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Abstract 


In  many  machine  learning  problems  and  application  domains,  the  data  are  naturally 
organized  by  groups.  For  example,  a  video  sequence  is  a  group  of  images,  an  image  is 
a  group  of  patches,  a  document  is  a  group  of  paragraphs/words,  and  a  community  is  a 
group  of  people.  We  call  them  the  collective  data. 

In  this  thesis,  we  study  how  and  what  we  can  leam  from  collective  data.  Usually, 
machine  learning  focuses  on  individual  objects,  each  of  which  is  described  by  a  feature 
vector  and  studied  as  a  point  in  some  metric  space.  When  approaching  collective  data, 
researchers  often  reduce  the  groups  into  vectors  to  which  traditional  methods  can  be 
applied.  We,  on  the  other  hand,  will  try  to  develop  machine  learning  methods  that 
respect  the  collective  nature  of  data  and  learn  from  them  directly. 

Several  different  approaches  were  taken  to  address  this  learning  problem.  When 
the  groups  consist  of  unordered  discrete  data  points,  it  can  naturally  be  characterized 
by  its  sufficient  statistics  -  the  histogram.  For  this  case  we  develop  efficient  methods 
to  address  the  outliers  and  temporal  effects  in  the  data  based  on  matrix  and  tensor 
factorization  methods. 

To  leam  from  groups  that  contain  multi-dimensional  real-valued  vectors,  we  de¬ 
velop  both  generative  methods  based  on  hierarchical  probabilistic  models  and  discrim¬ 
inative  methods  using  group  kernels  based  on  new  divergence  estimators.  With  these 
tools,  we  can  accomplish  various  tasks  such  as  classification,  regression,  clustering, 
anomaly  detection,  and  dimensionality  reduction  on  collective  data. 

We  further  consider  the  practical  side  of  the  divergence  based  algorithms.  To  re¬ 
duce  their  time  and  space  requirements,  we  evaluate  and  find  methods  that  can  effec¬ 
tively  reduce  the  size  of  the  groups  with  little  impact  on  the  accuracy.  We  also  pro¬ 
posed  the  conditional  divergence  along  with  an  efficient  estimator  in  order  to  correct 
the  sampling  biases  that  might  be  present  in  the  data.  Finally,  we  develop  methods 
to  leam  in  cases  where  some  divergences  are  missing,  caused  by  either  insufficient 
computational  resources  or  extreme  sampling  biases. 

In  addition  to  designing  new  learning  methods,  we  will  use  them  to  help  the  scien¬ 
tific  discovery  process.  In  our  collaboration  with  astronomers  and  physicists,  we  see 
that  the  new  techniques  can  indeed  help  scientists  make  the  best  of  data. 
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Chapter  1 
Introduction 


The  current  machine  learning  paradigms  mostly  focus  on  individual  objects  that  have  simple  rep¬ 
resentations.  For  example  in  the  tasks  of  classification,  regression,  and  clustering,  an  object  of 
interest  is  often  described  by  a  “feature  vector”,  and  abstracted  as  a  point  in  certain  metric  space. 
The  objective  of  learning  is  to  estimate  functions  that  map  these  points  to  target  variables  such 
as  the  class  labels  or  cluster  memberships,  with  the  goal  of  achieving  both  empirical  accuracies 
on  the  training  data  as  well  as  the  generalization  power  on  unseen  data.  This  “one  point  per  ob¬ 
ject”  abstraction  has  led  to  very  concise  representations,  elegant  mathematical  theories,  and  very 
successful  algorithms. 

Nevertheless,  we  also  realize  that  many  of  the  interesting  data  in  the  real  world  can  and  should 
be  treated  as  a  collection  of  constituent  items.  For  instance,  in  the  field  of  language  modeling  and 
text  processing,  an  article  can  be  considered  as  a  group  of  paragraph  or  sections,  and  further  a 
paragraph  is  a  group  of  words.  In  computer  vision  and  image  processing,  a  prevailing  assumption 
is  that  a  visual  scene  consists  of  a  group  of  local  image  patches.  In  recommendation  systems,  a 
user  is  mainly  described  by  the  group  of  products  he/she  bought.  In  social  network  a  community 
is  a  group  of  people.  In  these  problems,  the  actual  abstraction  should  be  “one  set  of  points  per 
object”.  We  call  these  kinds  of  data  that  are  organized  by  groups  as  the  collective  data.  In  the 
following,  we  shall  call  the  basic  constituent  entities  as  “points”,  and  the  aggregations  of  points  as 
“groups”  (or  equivalently  “sets”  or  “bags”). 

The  task  of  learning  from  collective  data  arises  in  many  application  domains,  yet  our  research 
is  largely  motivated  by  the  demands  from  the  scientific  community.  Due  to  the  advancement  of 
sensory  systems  and  the  ever  increasing  computation  power,  now  the  scientists  are  facing  data  that 
are  at  unprecedented  scales.  For  example  in  astronomy,  modern  telescope  pipelines  like  the  Sloan 
Digital  Sky  Survey1  (SDSS)  can  produce  observations  for  a  vast  amount  of  celestial  objects.  In 
physics,  large-scale  simulation  systems  such  as  the  JHU  Turbulence  Database  Clusters 2  (JTDC) 
were  implemented  to  study  the  dynamics  of  fluid  and  particles.  In  these  problems,  we  have  huge 
amount  of  collective  data  that  are  impossible  to  be  examined  by  experts.  Therefore,  computational 
assistance  is  needed. 

'http : / / www .sdss.org 

2http : / /turbulence . pha . jhu . edu 
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In  this  thesis,  we  try  to  answer  the  question:  how  and  what  can  we  learn  from  collective  data ? 
We  emphasize  that  it  is  important  to  look  beyond  the  point-level  behaviors  of  data  when  designing 
learning  algorithms  for  collective  data.  Consider  doing  the  task  of  novelty  detection  on  an  article 
that  is  represented  as  a  bag  (group)  of  words  (points).  While  paragraphs  talking  about  either 
“machine  learning”  or  “gummy  bears”  are  not  novel  on  their  own,  an  article  containing  both  of 
the  terms  might  be  interesting.  In  computer  vision,  it  is  unreliable  to  classify  a  scene  image  just 
by  the  presence  of  a  single  object  (e.g.  we  cannot  say  with  certainty  that  an  image  is  an  city  scene 
if  it  contains  a  building;  it  might  also  be  a  beach  scene).  Instead,  we  should  consider  the  overall 
composition  of  different  objects  in  this  image. 

The  most  straightforward  method,  and  indeed  the  mostly  used  one,  to  learn  from  collective  data 
is  to  treat  the  groups  as  single  objects  and  transform  them  into  vectors,  so  that  the  traditional  point- 
wise  learning  techniques  can  be  applied.  However,  in  many  situations  this  conversion  is  essentially 
a  feature  engineering  process  that  can  be  domain  specific  and  difficult.  Moreover,  it  is  likely  that 
during  the  conversion  some  useful  information  is  lost.  Therefore,  we  aim  at  developing  learning 
methods  that  inherently  respect  the  collective  nature  of  data  and  can  leam  from  them  directly. 

We  investigate  several  approaches  to  leam  from  different  types  of  collective  data.  The  first  type 
of  collective  data  are  groups  of  one -dimensional  discrete  points  that  are  modeled  by  categorical 
random  variables.  This  kind  of  data  are  abundant  in  text  processing  and  recommendation  systems, 
where  each  word  or  item  is  considered  a  discrete  symbol.  It  is  popular  to  make  the  assumption 
that  the  points  are  exchangeable  i.e.  the  order  of  the  points  within  a  group  carries  no  information, 
so  that  we  can  summarize  the  points  in  a  group  by  the  histogram  which  is  the  sufficient  statistics 
for  any  conceivable  statistical  methods.  This  approach  provides  a  natural  way  of  converting  the 
groups  into  vectors,  to  which  many  existing  learning  methods  can  be  applied.  In  this  work,  we 
leam  from  these  data  within  the  matrix  factorization  framework,  where  the  matrices  are  formed 
by  stacking  the  vector  representations  of  the  groups.  We  develop  two  algorithms.  The  first  one 
introduces  the  extra  “time”  dimension  to  model  the  temporal  effects  of  data  using  tensors.  The 
second  one  is  a  robust  method  that  enables  reliable  factorization  in  the  presence  of  outliers,  and  we 
use  it  for  anomaly  detection  purposes. 

In  more  general  and  more  common  data  sets,  the  groups  contain  points  that  are  continuous, 
multidimensional  vectors.  Intuitively,  each  group  is  a  point  cloud  in  a  multidimensional  space.  In 
this  case,  we  no  longer  have  a  natural  way  of  reducing  the  groups  into  the  concise  vector  represen¬ 
tations,  and  that  raises  more  challenges  into  our  learning  tasks.  To  attack  this  problem,  researchers 
often  “encode”  each  vector  by  a  discrete  point,  and  then  use  the  approach  described  in  the  pre¬ 
vious  paragraph.  Indeed,  “encoding”  itself  is  a  big  and  interesting  topic.  But  even  though  many 
sophisticated  algorithms  have  been  proposed  in  the  recent  years,  a  significant  amount  of  domain 
knowledge  and  human  effort  may  be  required  depending  on  the  specific  problem. 

We,  on  the  other  hand,  try  to  leam  from  these  groups  directly  without  any  conversion.  Both 
generative  and  discriminative  methods  are  studied.  From  the  generative  perspective,  we  can  model 
the  generating  process  of  the  groups  and  points,  and  then  use  the  insights  from  the  models  to 
help  us  accomplish  other  learning  tasks.  Again  assuming  the  exchangeability  of  points,  we  devise 
models  that  can  capture  the  multi-level  characteristics  of  the  groups  based  on  topic  modeling,  and 
then  use  them  for  group  anomaly  detection,  clustering,  and  classification.  These  models  consist  of 
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different  probabilistic  component  to  achieve  the  balance  between  flexibility,  speed,  and  robustness. 

Discriminative  methods  are  also  considered  to  leam  from  collective  data  based  on  the  sim¬ 
ilarity  or  dissimilarity  measures  between  groups.  Assuming  that  the  points  within  a  group  are 
i.i.d.  samples  from  some  underlying  distribution,  we  construct  novel  estimators  of  kernels  between 
the  groups  based  on  a  class  of  new  nonparametric  divergence  estimation  methods.  These  kernel  es¬ 
timators  are  provably  consistent  and  efficient  to  compute.  Having  them,  we  can  take  advantage  of 
the  existing  methods  that  solely  depends  on  similarities  (e.g.  support  vector  machines  (SVM)  and 
spectral  clustering  [122])  to  accomplish  various  learning  tasks  including  classifications,  regression, 
clustering,  dimensionality  reduction,  and  anomaly  detection.  In  our  experiments  on  both  synthetic 
and  real-world  data  sets,  these  new  methods  has  achieved  the  state-of-the-art  performances. 

These  methods  are  further  enhanced  to  cope  with  challenges  we  might  face  in  real  data  sets.  To 
increase  the  speed  of  the  algorithms,  we  examined  possible  ways  of  reducing  the  size  of  the  groups 
while  preserving  the  learning  performances.  We  also  investigated  the  possibility  of  accomplishing 
learning  tasks  using  only  partial  similarity  measures  so  that  less  group  kernel  computations  are 
needed.  Finally,  sampling  biases  in  collective  data  are  considered  and  we  proposed  novel  diver¬ 
gence  between  groups  to  solve  the  problem. 

We  want  our  research  to  be  truly  useful.  In  addition  to  designing  new  machine  learning  meth¬ 
ods,  we  developed  practical  tools  to  assist  the  scientific  researchers.  We  analyze  the  real-time  as¬ 
tronomy  data  generated  by  SDSS  using  the  algorithms  we  developed.  A  website  is  built  to  present 
interesting  celestial  objects  to  the  astronomers  and  to  collect  their  feedbacks.  Similar  tools  could 
be  used  to  assist  physicists  in  discovering  and  studying  novel  phenomena  in  large  scale  turbulence 
simulations. 

This  thesis  contains  several  of  our  published  work: 

•  Chapter  2:  [172] 

•  Chapter  3:  [103] 

•  Chapter  4:  [173,  174] 

•  Chapter  5:  [130,  131] 

•  Chapter  7:  [175] 

The  rest  of  this  chapter  is  organized  as  follows.  In  Section  1.1,  we  describe  some  notations 
that  we  use  throughout  this  thesis.  Section  1.2  and  1.3  introduce  the  background  and  literature  on 
learning  from  discrete  and  continuous  collective  data  respectively.  Finally  in  Section  1 .6  we  state 
the  purpose  the  our  research  and  overview  the  structure  of  this  thesis. 

1.1  Notations 

First,  we  list  some  symbols  in  Table  1.1  that  we  shall  use  through  out  this  thesis.  The  common 
format  of  the  data  sets  in  this  thesis  is  as  follows.  We  consider  a  data  set  that  consists  of  M  groups 
of  points  {Gm}m=h  M,  and  each  group  Gm  is  a  set  of  Nm  points  as  Gm  =  {xmn}n= 

The  numbers  of  points  in  each  group  can  be  different.  Note  that  we  only  consider  cases  where 
the  groups  are  pre-defined,  i.e.  we  are  not  addressing  the  clustering/partition  problems  of  how  to 
divide  the  points  into  groups. 
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Table  1.1:  Symbols 


Symbol 

Definition  and  Description 

M 

The  number  of  groups. 

Nm 

The  number  of  points  in  group  m. 

D 

For  discrete  data,  D  is  the  total  number  of  categorical  values.  For  continuous  data, 
D  is  the  dimensionality  of  the  points. 

%mn 

The  nth  point  in  group  m.  For  discrete  data,  xrnn  is  a  categorical  variable  taking 
values  from  {1, . . . ,  D}.  For  continuous  data,  xmn  e  MD. 

n 
^  m 

Group/Set/Bag  m.  Grn  =  {xrrK  | , . . . ,  xrrhNm  }  contains  the  set  of  points  in  group  m. 

Urn 

The  label  of  Gm.  In  classification  problems  ym  is  the  class  label,  while  in  clustering 
problems  ym  is  the  cluster  membership. 

fm 

If  we  assume  that  the  points  in  Gm  are  random  samples,  then  fm  is  the  distribution 
that  generates  these  samples.  Usually  fm  is  not  observed. 

g  m 

The  vector  representation  of  group  Grn  for  discrete  data.  gm  e  RD  is  usually  the 
histogram  of  points  in  Gm. 

X 

The  data  matrix  for  discrete  data.  X  =  [gi, . . . .  gA/]T  £  RMxD  is  constructed  by 
stacking  the  gm’s. 

NNG(r) 

The  nearest  neighbor  of  point  x  in  group  G. 

The  K -dimensional  probability  simplex. 

I 

The  identity  matrix. 

/(c) 

The  indicator  function.  I (true)  =  1  and  I (false)  =  0. 

© 

The  element-wise  multiplication  between  vectors  or  matrices  of  the  same  size. 

A 

V 

o 

The  inner  and  outer  products  of  vectors. 

The  points  in  the  groups  can  either  be  discrete  or  continuous  and  multidimensional.  For  discrete 
points,  xmn  is  a  categorical  variable  taking  values  from  {1, . . . ,  D},  where  D  is  the  number  of 
possible  values.  For  continuous  data,  xmn  is  a  D -dimensional  vector  as  xmn  e  MD,  where  D  is  the 
dimensionality.  For  example,  in  text  modeling,  we  can  consider  a  document  as  a  group  of  words. 
In  this  case,  Gm  is  a  document,  xrnn  is  the  nth  word  in  Gm,  and  xmn  can  only  take  one  of  the 
values/words  from  the  vocabulary.  In  computer  vision,  we  can  consider  an  image  as  a  group  of 
patches.  In  this  case,  Grn  is  an  image,  xmn  is  the  feature  of  the  nth  patch  in  Gm,  and  xrnn  is  a 
D-dimensional  feature  vector  extracted  to  described  that  patch. 

We  often  assume  that  the  points  in  Gm  are  random  samples  from  some  distribution  fm.  Usually 
fm  is  not  observed  and  has  to  be  either  inferred  or  estimated.  For  instance,  in  text  modeling,  under 
the  bag-of-words  (BoW)  assumption  which  ignores  the  information  carried  by  the  order  the  words, 
we  can  say  that  a  document  Gm  has  an  underlying  multinomial  word  distribution  fm,  and  the 
document  is  realized  by  randomly  sample  points/words  from  fm.  Similar  examples  can  also  be 
found  in  continuous  data  sets  based  on  appropriate  assumptions. 
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In  classification,  the  objects  of  interest  are  associated  with  class  labels.  We  use  ym  to  denote 
the  class  label  of  group  Gm.  Note  that  we  care  only  about  learning  on  groups,  therefore  the  class 
labels  (as  well  as  cluster  memberships  and  other  learning  output)  are  only  for  groups  and  not  for 
points.  This  is  different  from  multiple  instance  learning  [184],  where  the  data  are  organized  by 
groups  but  the  learning  is  on  points;  see  Section  1.4  for  more  details. 

To  denote  the  sub-matrices  and  sub- vectors,  we  use  the  Matlab®  notation.  For  example,  Xi:i0o,: 
denotes  the  first  100  rows  of  the  matrix  X. 

Nearest  neighbors  (NN)  are  also  frequently  used.  We  use  NNG(r)  to  denote  the  NN  of  point  x 
in  group  G.  If  x  is  in  G  then  it  excludes  itself  during  the  search.  Ties,  if  any,  are  broken  arbitrarily. 

1.2  Learning  from  Discrete  Data 

Many  collective  data  sets  contains  discrete/categorical  points.  For  example,  in  text  processing  doc¬ 
uments  comprises  discrete  words  that  take  values  from  a  vocabulary.  In  recommendation  problems, 
a  user  is  characterized  by  the  set  of  items  he/she  bought,  which  are  discrete  symbols. 

If  we  assume  that  these  discrete  points  are  infinitely  exchangeable,  i.e.  the  order  of  the  points 
does  not  affect  the  nature  of  the  groups,  then  we  can  succinctly  represent  the  group  by  its  sufficient 
statistics:  the  histogram.  Let  a  group  of  points  be  G  =  { x\ , . . .  ,xn}  with  xn  G  (1, . . . ,  D}. 

Then  G  can  be  represented  by  a  histogram  g  =  J2n= i  I(xn  =  !),•••>  Yln= i  =  D)  G  RD. 

This  approach  reduces  groups  to  vectors  for  which  we  have  mature  analysis  tools  and  learning 
techniques.  Note  that  no  information  is  lost  during  this  reduction  process  under  the  exchangeability 
assumption. 

In  some  problems,  even  when  the  points  are  not  discrete,  researchers  would  still  discretize  them 
using  techniques  like  vector  quantization  [59]  or  sparse  coding  [94,  181],  and  then  use  aggrega¬ 
tion  methods  that  are  similar  to  histogram  reduction.  A  well-know  example  is  the  bag-of-words 
representation  used  in  image  processing  and  computer  vision  (e.g.  [19,  50,  136]).  Inevitably,  the 
information  carried  by  the  original  data  might  be  compromised  during  this  reduction.  Neverthe¬ 
less,  the  resulting  vectorial  representation  is  compact  and  familiar,  and  researchers  can  still  develop 
good  learning  methods  based  on  this  reduced  representation.  In  fact,  the  problem  of  how  to  effec¬ 
tively  discretize  the  points  makes  its  own  field,  but  it  is  out  of  our  focus. 

Nevertheless,  effectively  algorithms  to  learn  from  groups  with  discrete  points  is  very  important, 
and  they  can  be  also  applied  to  many  traditional  problems  with  individual  vectorial  points.  In  the 
later  part  of  this  thesis,  we  develop  algorithms  that  can  learn  from  collective  data  directly  without 
the  discretization  or  other  conversions. 

1.2.1  Factorization  Methods 

In  this  research,  the  learning  of  discrete  collective  data  that  can  be  converted  into  vectors  is  done 
under  the  factorization  framework  for  its  simplicity,  speed,  and  wide  applicability. 

Matrices  are  very  useful  in  representing  data.  Vectors  can  be  stacked  to  form  matrices  such 
as  the  design  matrices  in  regression  and  classification,  and  the  document-word  matrix  in  language 
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modeling  and  text  processing.  Matrices  are  also  used  to  describe  networks  and  graphs,  as  well  as 
preference  data  in  collaborative  filtering  [85,  138]. 

We  denote  the  data  matrix  as  X  6  MMx  d,  where  each  row  represents  a  groups  and  each 
column  represents  one  discrete  value.  One  of  the  most  common  analysis  for  X  is  factoriza¬ 
tion/decomposition,  such  as  the  principal  component  analysis  (PCA).  For  design  matrices,  PCA 
reveals  the  intrinsic  linear  structure  of  data.  For  text  data,  latent  semantic  indexing  (LSI)  [72]  and 
non-negative  matrix  factorization  (NMF)  [43]  are  often  applied.  The  low-rank  assumption  is  also 
useful  in  matrix  completion  [23,  111]  and  collaborative  filtering  [138,  141]. 

To  do  a  low-rank  factorization,  we  assume  that  X  has  a  low  rank  K  and  decomposes  as 

x  «  uvr,  U  e  RMxK,\ e  RDxK.  (l.l) 

There  are  several  ways  to  interpret  the  factor  matrices  U  and  V.  They  can  be  thought  of  as  the 
small  number  of  bases  and  coefficients  to  reconstruct  the  matrix.  They  can  be  also  interpreted  as 
the  low-dimensional  latent  factors/features  for  the  rows  and  columns  of  the  matrix  X,  such  that  the 
inner-products  of  the  factors  approximate  the  matrix’s  entries.  In  terms  of  our  learning  problem, 
the  bases  interpretation  means  that  each  group  can  be  approximated  by  a  linear  combination  of 
“basis  groups”;  The  factor  interpretation  means  that  each  group  and  discrete  value  have  a  latent 
feature,  and  the  “compatibility”  between  the  features  of  a  group  and  a  value  determines  how  often 
that  value  appears  in  the  group. 

1.2.2  Temporal  Modeling 

Successful  as  they  are,  one  limitation  of  most  existing  factorization  algorithms  is  that  they  are  static 
models  in  which  groups  are  assumed  to  be  stationary  over  time.  However,  real  data  is  often  evolv¬ 
ing  over  time  and  exhibits  strong  temporal  patterns,  and  traditional  static  methods  are  incapable 
of  learning  the  shift  of  a  group’s  composition  of  points.  In  other  words,  a  group’s  properties  and 
behavior  may  change  over  time,  but  it  can  only  be  represented  one  fixed  latent  factor  in  traditional 
factorization  models. 

Another  outstanding  problem  for  many  data  set  is  that  they  are  often  very  sparse.  For  example, 
a  document  only  contains  a  very  small  set  of  words  compared  to  the  whole  vocabulary.  Taking 
the  Netflix  3  data  for  instance,  there  are  17,  770  movies  and  480, 189  users,  but  only  99,  072, 112 
training  ratings.  This  means  that  we  are  learning  from  a  matrix  with  only  1.16%  of  its  entries 
given.  This  phenomenon  presents  two  challenges  for  us.  The  first  one  is  how  to  avoid  over-fitting, 
and  the  second  is  how  to  take  advantage  of  this  sparsity  to  accelerate  computation. 

To  solve  the  problems,  we  propose  a  factorization  method  that  is  able  to  model  time-evolving 
data.  In  addition  to  the  factors  that  are  used  to  characterize  groups  and  values,  we  introduce  another 
set  of  latent  factors  for  time  itself.  Intuitively,  these  additional  factors  represent  the  population-level 
modulation  of  latent  features  at  each  particular  time.  This  kind  of  modeling  allows  us  to  introduce 
flexibility  into  the  time  dimension  without  further  sparsifying  the  data,  which  would  happen  if  we 
where  to  estimate  different  models  at  different  time.  We  further  enhance  the  method  by  Bayesian 

'  http://www.netf!  ixprize.com/ 
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techniques  to  avoid  overfitting.  Finally,  the  speed  of  the  new  algorithm  is  not  much  slower  than 
static  methods. 

1.2.3  Robust  Factorization 

Another  aspect  of  data  we  are  interested  in  is  the  outliers/anomalies.  Real-world  problems  almost 
always  involve  anomalies  or  outliers  that  do  not  conform  to  our  assumptions.  They  can  severely 
degrade  the  models’  quality,  or  lead  to  novel  discoveries.  Thus  we  want  robust  methods  that  can 
produce  high-quality  models  as  well  as  find  the  outliers.  The  definition  of  outlier  varies  depending 
on  specific  problems,  but  in  general  outliers  lie  in  the  low-density  regions  of  data  distributions. 
[26]  surveyed  outlier  detection  problems.  In  our  factorization  work,  we  consider  subspace  outliers 
and  assume  that  normal  data  reside  in  a  low-dimensional  linear  subspace  (the  row/column  space 
of  the  low-rank  matrix).  For  instance  in  signal  processing,  a  normal  signal  can  be  reconstructed  by 
a  few  bases.  If  a  signal  cannot  be  well  reconstructed,  it  is  an  outlier. 

Factorization  methods  are  often  not  robust  due  to  the  L2-norm  used  to  measure  approximation 
errors  [86,  176].  Many  robust  estimators  has  been  proposed  (e.g.  [67,  79,  86,  88,  99]).  A  common 
approach  is  to  replace  the  L2-norm  with  robust  norms  are  insensitive  to  outliers.  For  example  Li 
norm  is  widely  used  for  robustness  [15,  22].  Other  measures  like  the  Huber  loss  [74]  and  the 
Geman-McClure  function  have  also  been  employed  [88,  123].  Another  strategy  is  to  exclude  the 
outliers:  we  first  guess  which  data  are  outliers,  and  then  reduce  their  influences  [86,  176]. 

In  this  thesis,  we  took  the  approach  of  using  robust  norms.  Specifically,  we  use  the  L0-norm, 
which  counts  the  number  of  outliers  disregarding  their  magnitudes,  to  replace  the  L2-norm  to 
measure  the  errors.  The  resulting  algorithm  is  simple,  fast,  and  effective.  It  can  also  be  shown  that 
the  recently  popular  algorithms  using  the  L  \  -norm  and  the  nuclear  norm  [22,  177]  are  relaxations 
of  this  method. 


1.3  Learning  from  Continuous,  Multidimensional  Data 

In  many  other  problems,  points  are  multi-dimensional  vectors  with  continuous  values.  In  this  case, 
it  is  not  easy  to  summarize  a  group  by  a  vector  or  get  sufficient  statistics.  Current  learning  of  these 
data  are  often  done  by  discretization.  However,  this  conversion  step  may  lose  valuable  information 
and  might  need  significant  domain  knowledge.  Therefore,  we  want  to  attack  this  problem  directly. 
Our  basic  assumption  is  that  the  points  in  group  Gm  are  infinitely  exchangeable  samples  from  an 
underlying  probability  distribution  fm.  To  learn  from  the  groups,  we  leam  the  /m’s. 

1.3.1  Generative  Models 

To  motivate  the  characterization  of  groups  we  consider  the  problem  of  finding  group  anomalies. 
We  consider  two  types  of  group  anomalies.  A  point-based  group  anomaly  is  a  group  that  contains 
individually  anomalous  points  e.g.  an  image  containing  a  two-headed  wolf.  A  distribution-based 
anomaly  is  a  group  where  the  points  are  relatively  normal,  but  as  a  whole  they  are  unusual  e.g.  an 
image  containing  a  pack  of  wolves  and  a  flock  of  sheep  together. 
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Most  existing  work  on  group  anomaly  detection  focuses  on  point-based  anomalies.  They  first 
identify  anomalous  points  and  then  find  their  aggregations.  Clearly  this  paradigm  will  not  work 
for  distribution-based  anomalies.  One  solution  is  to  design  problem-specific  features  for  groups. 
However,  it  relies  on  feature  engineering  that  is  domain  specific  and  can  be  difficult. 

In  this  thesis,  we  design  several  probabilistic  models  to  capture  the  generating  process  of  the 
collective  data.  By  training  these  models,  we  can  learn  the  “normality”  of  the  data,  and  hence 
detect  unusual  behaviors.  We  can  also  infer  the  important  latent  attributes  of  the  groups  that  can 
help  us  find  both  types  of  anomalies.  The  tools  we  use  are  developed  based  on  topic  modeling. 


1.3.2  Topic  models 

We  can  leam  the  generating  process  of  the  groups  using  probabilistic  models.  For  this  purpose, 
particular  useful  are  the  topic  models ,  among  which  the  probabilistic  latent  semantic  ancdysis 
(PLSA)  [72]  and  latent  Dirichlet  allocation  (LDA)  [14]  are  the  most  well-known. 

Topic  models  are  originally  proposed  for  text  modeling,  where  we  have  words  as  points  and 
document  as  groups.  They  are  hierarchical  mixture  models  built  upon  the  assumption  of  exchange¬ 
able  points  i.e.  the  order  of  points/words  does  not  matter.  Essentially  in  LDA,  a  group  Grn  is  mod¬ 
eled  by  a  mixture  density  fm  =  J2k=i  @mk/3k,  where  K  is  the  number  of  mixture  components. 
We  call  the  mixture  components  /Vs  as  the  topics  and  the  mixing  weights  0m  e  §A  as  the  topic 
weights.  LDA  forces  all  groups  to  share  the  same  topics  {/3fc}^=1  so  as  to  share  information  and 
enhance  the  statistical  power. 

Topic  models  are  often  described  by  generative  schemes.  Lor  example  a  LDA  model  with 
topics  {Pk}k=i  and  prior  Dirichlet  topic  weight  distribution  Vir(a )  can  be  described  by  Algorithm 
1,  and  the  resulting  complete  likelihood  is 

p  {Clm,  0m,  zTO|o;,  /3)  =  T3ir  [6  m\oi)  1 1  J\A. {zmn \6m)  (3  Zmn  ( xmn )• 


Algorithm  1  The  generative  process  of  LDA. 

Lor  group  m  =  1  to  M: 

1.  Choose  the  topic  weight  0m 

eSA,V~ 

Vir(a). 

2.  Lor  points  n  —  1  to  Nm: 

(a)  Choose  a  topic  zrnn  ~ 

(@rn)  j  %mn 

e{l 

(b)  Generate  a  point  xrnn 

~  Pzmn- 

Although  topic  models  are  proposed  for  discrete  data  like  text,  it  is  straightforward  to  use  them 
for  continuous  multi-dimensional  points  by  using  multivariate  distributions  such  as  Gaussians  as 
the  topics  {/3k}.  The  idea  stays  the  same:  we  approximate  the  underlying  distribution  fm  with  a 
mixture  model  QmkPk,  and  try  to  figure  out  how  this  mixture  model  was  generated. 
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1.3.3  Enhanced  Topic  Models 

The  challenge  we  face  in  using  generative  models  for  group  anomaly  detection  is  how  to  devise 
flexible  models  to  fully  characterize  the  data.  In  traditional  topic  models,  the  model  parameters, 
such  as  the  topic  weight  distribution  Vir(a),  are  considered  “priors”,  and  their  role  is  incorporate 
prior  knowledge  to  assist  the  estimation  of  latent  variables.  In  our  problem,  however,  we  need 
the  model  parameters  to  describe  the  detailed  behaviors  of  data  in  ordered  to  differentiate  what  is 
normal  and  what  is  not. 

Since  LDA,  various  improvements  have  been  proposed.  Many  of  them  enhanced  the  flexibility 
of  generating  mechanism  of  topic  distributions  [50,  80]  or  capture  correlation  between  topics  [13, 
102] .  On  the  other  hand,  [45]  allow  the  topics  to  vary  for  different  groups  in  order  to  account  for  the 
burstiness  of  words.  These  ideas  are  helpful  ingredients  for  creating  a  model  that  can  thoroughly 
capture  how  groups  are  generated. 

We  proposed  several  enhanced  topic  models  for  group  anomaly  detection.  We  study  the  mod¬ 
els’  flexibility,  robustness,  learning  and  inference  speed,  etc ,  and  present  our  findings.  Based 
on  these  models,  we  proposed  several  novel  scoring  functions  to  detect  both  the  point-based  and 
distribution-based  anomalies. 

1.3.4  Discriminative  Methods 

Discriminative  methods  can  also  be  used  to  learn  from  collective  data,  in  which  we  circumvent  the 
need  of  the  generating  process  of  data  and  aim  directly  at  what  we  want  to  leam,  e.g.  the  class  label 
of  a  group.  Here  we  focus  on  learning  methods  that  are  based  on  pairwise  (dis)similarity  measures, 
such  as  the  SVM. 

Several  methods  has  been  proposed  to  measure  the  similarity  between  sets  of  vectors.  [171] 
used  several  traditional  sets  distances  such  as  the  Hausdorff  distance  for  this  purpose.  [63,  64] 
proposed  the  pyramid  matching  kernels  between  vector  sets  based  on  hierarchical  approximate 
matching  of  points.  [170]  measures  group  similarities  based  on  the  angles  between  the  subspaces 
spanned  by  the  points  from  different  groups.  [151]  proposed  algebraic  kernels  between  matrices 
that  represent  sets  of  vectors. 

[65, 155]  tries  to  embed  probability  distributions  into  reproducing  kernel  Hilbert  spaces  (RKHS). 
In  these  methods,  a  density  /  is  mapped  to  a  mean  function  pf  in  a  RKHS  77/,  induced  by  a  kernel 
k(-,  •)  as  pf(-)  =  [k(x,  •)].  Then  the  inner-product  between  densities  is  just  the  inner-product 

of  the  mean  functions  in  77/,,  which  is  equivalent  to  the  average  kernel  values  between  each  pair  of 
inter-group  point  pairs.  The  discrepancy  between  densities  is  defined  as  the  distance  between  the 
mean  functions  in  77/,. 

Alternatively,  we  can  measure  the  divergences  between  the  underlying  distribution  /m’s.  In 
statistics,  this  question  can  be  answered  by  the  results  from  two-sample  tests  (e.g.  the  probability 
of  rejecting  the  null  hypothesis)  such  as  the  Student  t-test,  the  Kolmogorov -Smirnov  test,  and  the 
permutation  test.  However  these  methods  either  rely  on  parametric  assumptions,  use  only  limited 
statistics,  or  have  difficulties  in  high-dimensions. 

Another  approach  is  to  first  estimate  densities  fm  first  and  then  measure  similarities.  [140] 
compute  divergences  by  discretizing  the  continuous  densities.  [75]  defines  Fisher  kernels  between 
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parametric  densities.  [119]  fits  Gaussian  mixture  models  (GMM)  to  compute  the  Kullback-Leiber 
(KL)  divergences.  [76]  fits  exponential  family  densities,  and  then  compute  product  kernels  between 
these  densities  in  an  RKHS.  [42]  defined  a  kernel  on  the  level-sets  of  fitted  densities.  The  problem 
with  these  methods  is  that  density  estimation  is  itself  notoriously  difficult  and  parametric  methods 
often  introduce  biases. 

We  took  the  distributional  approach  in  this  work.  Based  on  the  work  of  [129],  we  propose 
a  nonparametric  method  to  estimate  a  family  of  kernels  between  distributions  based  on  observed 
samples,  while  avoiding  explicit  density  estimates.  These  estimators  are  both  efficient  and  accu¬ 
rate,  and  is  able  to  achieve  the  state-of-the-art  performance  on  real  data  sets. 

1.3.5  Accelerated  Learning 

One  major  disadvantage  of  the  algorithms  that  learn  from  collective  data  is  their  high  computational 
cost  compared  to  those  operate  on  vectors.  For  example,  in  computer  vision,  by  discretizing  the 
local  features  and  aggregating  them  by  the  “bag  of  visual  words”  method  [50],  typically  a  256  x  256 
image  can  be  characterized  by  one  1, 000-dimensional  vector,  amounting  to  just  1KB  of  data.  On 
the  other  hand,  to  represent  an  image  as  a  group  of  SIFT  features,  we  typically  need  more  than 
1,  500  128-dimensional  vectors,  amounting  to  about  190KB  of  data.  The  further  computation 
needed  to  process  such  groups  is  likely  to  be  also  orders  of  magnitudes  larger.  In  order  to  make 
the  learning  algorithms  in  this  work  truly  useful,  this  hurdle  of  computational  efficiency  must  be 
overcome. 

We  explore  different  ways  to  improve  the  speed  of  the  group  similarity  based  methods.  In  most 
cases,  the  cost  to  train,  store,  and  apply  the  model  is  determined  by  the  sizes/cardinalities  of  the 
groups.  Therefore,  one  approach  is  to  directly  attack  the  crux  of  the  problem  by  reducing  the  size 
of  groups  while  maintaining  the  learning  performance,  in  an  unsupervised  way.  We  call  such  an 
operation  condensing.  We  analyze  and  evaluate  several  possible  ways  of  decrease  the  size  of  a 
group,  and  discover  that  distribution  approximation  via  A'- Means  can  successfully  achieve  the  goal 
of  condensing. 

Another  way  to  improve  the  speed  of  similarity-based  algorithms  is  to  reduce  the  number  of 
similarity  evaluations.  In  SVM  or  spectral  clustering ,  we  need  a  full  kernel  matrix  or  distance 
matrix,  which  means  that  similarities  are  needed  for  every  pair  of  groups.  In  many  problems, 
structures  exist  in  the  similarity/divergence  matrices  that  allows  us  to  infer  the  full  matrix  based 
on  only  part  of  the  entries.  By  exploiting  such  structures,  we  can  compare  only  some  pairs  of 
groups  and  “complete”  the  similarities  between  the  other  pairs.  In  this  work,  we  study  methods 
to  complete  both  the  kernel  matrices  and  the  divergence  matrices,  and  compare  their  empirical 
performances. 

1.3.6  Sampling  Bias 

One  factor  that  can  significantly  affect  the  effectiveness  of  learning  is  the  sampling  bias.  In  realis¬ 
tic  situations,  sampling  bias  alters  the  way  we  collect  points  from  the  underlying  distribution,  and 
makes  the  observed  sample  not  representative  of  the  true  distribution.  In  other  words,  even  though 
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the  group  Gm  has  a  underlying  distribution  fm,  the  actual  points  in  Gm  may  not  be  faithful  sam¬ 
ples  from  fm,  but  rather  drawn  from  some  distorted  version  of  fm.  Therefore,  it  undermines  the 
fundamental  validity  of  learning  algorithms,  including  all  the  methods  we  proposed  in  this  work. 
Though  been  extensively  studied  in  statistics,  this  key  problem  has  been  largely  ignored  by  the 
previous  research  on  learning  from  collective  data. 

We  propose  conditional  divergences  to  correct  these  distortions  and  leam  from  biased  groups 
effectively.  Traditional  divergences  mainly  compares  the  joint  distribution  of  the  random  variables. 
On  the  other  hand,  conditional  divergences  focus  on  the  conditional  distributions  of  some  variables 
given  the  rest,  and  is  insensitive  to  the  distribution  of  the  variables  that  we  are  conditioning  on.  As 
long  as  the  conditional  distributions  are  intact,  the  conditional  divergences  will  be  accurate.  An 
efficient  estimator  is  also  developed  for  the  conditional  divergences. 


1.4  Related  Fields 


Several  other  research  fields  are  closely  related  to  learning  from  collective  data.  Statistical  rela¬ 
tional  learning  (SRL)  [60]  enhances  point-centered  machine  learning  by  consider  a  group  of  point 
and  their  relationships  altogether.  For  example,  SRL  studies  the  collective  classification  problem, 
where  the  goal  is  to  simultaneously  classify  several  objects  based  on  their  attributes  and  relations. 
This  problem  is  also  studied  under  the  name  of  structural  prediction  typically  using  Markov  net¬ 
works  [158]  or  large-margin  approaches  [161].  Even  though  collective  behaviors  are  considered, 
SRL  still  tries  to  learn  the  labels  of  the  points,  whereas  our  research  shall  only  focus  on  the  groups. 

Multiple  instance  learning  (MIL)  [184]  also  tries  to  classify  groups  of  points.  In  MIL,  a  group 
is  positive  if  at  least  one  of  its  point  is  positive;  otherwise  it  is  negative.  Consequently,  in  MIL  the 
nature  of  a  group  is  determined  by  a  few  of  its  points.  By  comparison,  we  assume  that  it  is  the 
holistic  behavior  of  all  the  points  that  characterizes  a  group.  Nevertheless,  sometimes  the  methods 
for  learning  from  collective  data  indeed  overlap  with  methods  for  MIL.  Lor  example,  some  kernels 
[12,  55]  can  be  used  to  do  MIL  as  well  as  learning  from  collective  data. 

Another  related  field  is  on  graph  kernels.  Graph  kernels  studies  the  similarities  between  graphs, 
which  are  defined  by  a  set  of  nodes  and  edges.  Graphs  can  also  be  considered  as  collective  data. 
But  unlike  before,  graph  data  are  structured  and  the  elements  in  a  graph  can  no  longer  be  considered 
i.i.d.  or  exchangeable.  Graph  kernels  often  count  the  intersection  of  sub- structures  between  graphs. 
Lor  example,  the  random  walk  kernel  [54,  165]  measures  the  path  similarity  of  random  walks  in 
different  graphs.  [6]  designed  kernels  between  groups  based  on  graph  kernels. 

Linally,  quantization/discretization/encoding  has  always  been  the  traditional  way  of  learning 
from  collective  data.  These  methods  first  turns  vectors  into  discrete  points,  and  then  reduce  the 
groups  into  vectors.  When  such  conversion  is  done,  our  algorithms  of  learning  from  discrete  data 
can  be  applied.  But  our  main  focus  of  this  thesis  is  to  avoid  such  conversions  and  learn  high-quality 
results  from  collective  data  directly. 
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1.5  Challenges  in  Scientific  Data 

This  research  is  motivated  by  the  need  of  analyzing  scientific  data  sets.  The  proposed  algorithms 
are  intended  to  help  the  scientists  learn  about  the  data  sets  they  observe.  Here  we  describe  two 
projects  that  involves  astronomical  surveys  and  particle/fluid  simulation  systems.  In  these  projects, 
we  want  to  conduct  explorative  study  to  obtain  the  overall  data  profile,  and  pick  out  potentially 
interesting  things  using  novelty  detection. 

Astronomical  Surveys 

Astronomical  surveys  provide  a  holistic  view  of  the  universe  by  imaging  a  large  portion  of  the 
sky.  The  Sloan  Digital  Sky  Surx’ey 4  (SDSS)  project  has  imaged  more  than  35%  of  the  sky  and 
gives  millions  of  observations  for  stars,  galaxies,  quasars,  and  other  celestial  objects.  Astronomers 
are  also  planning  for  even  more  powerful  survey  telescopes  such  as  the  Large  Synoptic  Survey 
Telescope 5  (LSST)  that  can  scan  the  sky  deeper  and  faster.  The  massive  amount  of  data  produced 
by  these  surveys  calls  for  the  assistance  of  computational  methods. 

We  focus  on  the  spectroscopic  observations  in  the  SDSS  data  set.  SDSS  provides  for  each 
object  a  3700-dimensional  spectrum  as  shown  in  Figure  1.1b,  along  with  its  3D  spatial  location  for 
distant  objects  like  galaxies  and  quasars.  We  directly  take  these  spectra  as  the  feature  vectors  for 
these  objects. 


RA=233.36609,  DEC=56.55968,  MJD=53437,  Plate=  614.  Fiber=571 


(a)  Photometric 

Figure  1.1:  A  galaxy  observation  from  SDSS. 

Two  novelty  detection  tasks  are  currently  considered  on  this  data  set.  The  first  one  is  to  find 
individually  anomalous  objects  such  as  planetary  nebulae.  This  task  is  traditional  in  that  we  want 
to  find  unusual  vectorial  points,  yet  it  still  poses  unique  challenges.  For  example,  the  spectra 
have  high-dimensionality  which  makes  it  difficult  to  use  density-based  methods,  and  they  usually 


4http : //www . sdss . org 
5http : //www . Is st .org 
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contain  emission  lines  (see  the  spikes  in  Figure  1.1b)  that  could  easily  distort  models.  We  focus  on 
subspace  outliers  assuming  that  normal  spectra  can  be  reconstructed  by  a  few  bases,  and  develop 
robust  factorization  methods  to  address  the  emission  lines. 

The  second  task  is  to  detect  special  clusters  of  galaxies.  Based  on  the  3D  spatial  locations, 
we  can  find  nearby  galaxies  and  put  them  into  clusters/groups.  These  clusters  could  shed  light  on 
the  development  of  the  universe  [166],  and  it  will  be  valuable  to  find  interesting  clusters  for  the 
astronomers.  In  this  case,  each  cluster  contains  a  set  of  spectrum  vectors,  and  we  shall  address  this 
problem  using  group  anomaly  detection  methods. 


Particle/Fluid  Simulation 


Figure  1.2:  Local  motion  patterns  in  the  JTDC  simulation.  Several  2D  slices  are  presented. 
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In  physics,  researchers  often  simulate  particle  or  fluid  systems  at  a  very  large  scale.  For  ex¬ 
ample,  the  JHU  Turbulence  Database  Clusters 6  (JTDC)  provides  open  access  to  10244  space-time 
points  in  fluid  simulations.  At  each  points,  the  3D  velocity  as  well  as  other  information  including 
pressure  and  temperature  are  recorded.  See  Figure  1.2  for  some  examples.  Our  task  is  again  to 
detect  interesting  phenomena  in  such  massive  data  sets. 

In  these  systems,  a  single  particle  is  seldom  interesting,  but  a  group  of  particles  can  form 
interesting  phenomena  like  the  vortices  as  in  Figure  1.2c.  This  can  again  be  framed  as  a  group 
anomaly  detection  problem.  We  treat  points  in  a  local  region  as  a  group,  and  aims  at  finding 
interesting  collective  motion  patterns  characterized  by  the  distribution  of  locations,  velocities,  and 
other  relevant  features.  In  other  scenarios,  the  researchers  could  give  some  examples  of  interesting 
phenomena,  and  then  our  supervised  methods  will  classify  the  local  regions  accordingly. 


1.6  Thesis  Overview 


Motivated  by  numerous  practical  problems,  we  propose  various  algorithms  and  models  that  can 
leam  from  collective  data  effectively,  and  present  our  answers  to  the  question  how  and  what  we 

6http : //turbulence . pha . jhu . edu 
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can  learn  from  collective  data.  We  study  different  data  types  (discrete  and  continuous),  as  well  as 
different  learning  approaches  (generative  and  discriminative).  Efforts  are  also  made  to  improve  the 
practicality  of  the  proposed  methods  from  multiple  angles. 

The  rest  of  this  thesis  is  organized  as  follows.  In  Chapter  2  and  3  we  describe  two  algorithms 
to  learn  from  discrete  collective  data  focusing  on  modeling  the  temporal  effects  and  the  outliers. 
Chapter  4  and  5  describe  the  generative  and  discriminative  methods  of  learning  from  continuous 
multi-dimensional  data.  Chapter  6  further  studies  how  to  construct  Mercer  kernels  for  collective 
data.  Then  we  describe  how  to  accelerate  the  learning  while  maintaining  accuracies  in  Chapter 
7.  Chapter  8  proposed  conditional  divergences  to  correct  the  sampling  biases  in  collective  data. 
Finally  in  Chapter  9  we  summarize  the  thesis  and  discuss  future  directions  for  this  research. 
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Part  I 

Learning  from  Discrete  Data 
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Chapter  2 

Modeling  Temporal  Effects  by  Tensor 
Factorizations 


In  the  chapter  and  the  next,  we  are  learning  from  groups  that  contain  discrete  points  and  can  be 
converted  into  vectors.  Specifically  in  this  chapter,  we  shall  discuss  the  recommendation  a.k.a.  col¬ 
laborative  filtering  problems,  in  which  a  user  rates  a  set  of  items,  and  the  goal  is  to  find  out  which 
other  items  this  user  might  like.  But  the  proposed  factorization  algorithms  can  also  be  applied  to 
other  problems  of  similar  natures. 

We  consider  the  temporal  dynamics  in  collaborative  filtering  problems.  Real-world  data  are 
seldom  stationary,  yet  traditional  collaborative  filtering  algorithms  generally  rely  on  this  assump¬ 
tion.  Motivated  by  our  sales  prediction  problem,  we  propose  a  factor-based  algorithm  that  is  able 
to  take  time  into  account.  By  introducing  additional  factors  for  time,  we  formalize  this  problem  as 
a  tensor  factorization  with  a  special  constraint  on  the  time  dimension.  Further,  we  provide  a  fully 
Bayesian  treatment  to  avoid  fine-tuning  the  parameters  and  achieve  automatic  model  complexity 
control.  To  leam  this  model  we  develop  an  efficient  sampling  procedure  that  is  capable  of  analyz¬ 
ing  large-scale  data  sets.  This  new  algorithm,  called  Bayesian  Probabilistic  Tensor  Factorization 
(BPTF),  is  evaluated  on  several  real-world  problems  including  sales  prediction,  movie  recommen¬ 
dation,  and  music  recommendation.  Empirical  results  demonstrate  the  superiority  of  the  temporal 
model. 


2.1  Introduction 

Nowadays,  recommendation  a.k.a.  collaborative  filtering  algorithms  play  a  vital  role  in  various 
automatic  recommendation  systems  and  has  been  used  in  many  online  applications  such  as  Ama¬ 
zon.com,  eBay,  and  Netflix.  The  set  up  of  the  problem  is  as  follows.  Different  users  rates  different 
items  based  on  their  preferences.  Suppose  that  we  have  observed  for  each  user  the  set  of  items 
he/she  has  rated  in  the  past.  Then,  collaborative  filtering  tries  to  predict  the  how  a  user  would  rate 
a  currently  unrated  item.  Note  that  here  rating  can  mean  either  the  actual  preference  such  as  movie 
ratings,  or  other  indications  such  as  the  quantity  of  the  consumed  item. 

Successful  as  they  are,  one  limitation  of  most  existing  methods  is  that  they  are  static  models  in 
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which  the  statistical  properties  of  data  are  assumed  to  be  the  same  at  different  time.  However,  real 
data  is  often  evolving  over  time  and  exhibits  strong  temporal  patterns.  To  motivate  our  research, 
let  us  consider  the  following  problem.  A  shoe  production  company  sells  many  types  of  shoes  to  its 
retailers  around  the  world.  Now  this  company  wants  to  predict  the  demand  of  different  shoes  by 
different  retailers  for  the  ongoing  season  based  on  this  season’s  initial  orders  and  historical  sales 
data.  Having  this  prediction,  the  company  can  make  more  informed  decisions  on  the  marketing 
strategy  and  inventory  planning.  Obviously,  the  behavior  of  the  market  and  the  retailers  is  changing 
over  time. 

The  traditional  way  to  solve  this  problems  is  to  use  statistical  regression  models  or  time-series 
forecasting  techniques  for  each  shoe-retailer  pair.  Ideally,  regression  models  can  predict  the  order 
using  the  features  of  the  retailers  and  the  shoes.  But  the  reality  is  that  few  retailer  and  product 
attributes  are  available  due  to  the  complexity  of  the  domain  knowledge  and  policy  issues.  What  we 
have  is  only  the  transaction  data  recording  the  retailer,  product,  and  quantity  of  each  order.  There¬ 
fore,  it  is  more  convenient  to  treat  the  products  as  discrete  symbols,  and  represent  the  retailers  by 
the  sets  of  products  they  ordered.  On  the  other  hand,  typical  time-series  models  such  as  autoregres¬ 
sive  moving  average  (ARMA)  and  exponential  smoothing  [20]  use  past  data  to  make  predictions. 
But  they  are  not  suitable  for  our  problem  neither  because  the  data  are  extremely  scarce  and  each 
season  many  new  products  are  introduced,  for  which  no  historical  data  exist.  Moreover,  both  of 
these  two  paradigms  cannot  exploit  the  “collaboration”  between  entities  and  hence  are  expected  to 
perform  poorly  when  the  data  is  sparse.  For  these  reasons,  we  use  collaborative  filtering  to  make 
the  prediction. 

Even  if  collaborative  filtering  is  able  to  handle  our  data,  traditional  static  methods  are  incapable 
of  learning  the  shift  of  product  designs  and  customers’  preferences,  especially  considering  that  we 
are  facing  the  volatile  and  fast-moving  fashion  business.  The  preference  of  the  market  can  change 
from  season  to  season  and  even  within  each  season.  In  this  case,  trying  to  explain  all  the  data  with 
one  fixed  global  model  would  be  ineffective.  On  the  other  hand,  if  we  only  use  the  recent  data 
or  down-weigh  the  past,  a  lot  of  useful  information  would  be  lost,  making  the  already  very  sparse 
data  set  even  worse. 

To  solve  this  problem,  we  propose  a  factorization  based  method  that  is  able  to  model  time- 
evolving  data.  This  method  is  based  on  probabilistic  latent  factor  models  [141,  142].  In  addition 
to  the  factors  that  are  used  to  characterize  retailers  and  shoes,  we  introduce  another  set  of  latent 
features  for  each  different  time  period.  Intuitively,  these  additional  factors  represent  the  population- 
level  preference  of  different  (latent)  features  of  the  shoes  at  each  particular  time,  so  that  they  are 
able  to  capture  concepts  like  “high-heeled  shoes  lost  their  popularity  this  fall”  or  “orders  of  golf 
shoes  tend  to  arrive  late”.  A  special  constraint  is  imposed  on  the  time  factors  to  ensure  that  the 
evolution  of  factors  is  smooth.  This  model  leams  the  features  of  the  entities  using  all  the  available 
data,  while  adapts  these  features  to  different  time  periods.  It  can  be  formulated  as  a  probabilistic 
tensor  factorization  problem,  thus  is  widely  applicable  to  other  similar  data  sets. 

The  modeling  of  temporal  effects  is  also  useful  in  other  collaborative  filtering  problems,  since 
often  the  preferences  of  users  are  subject  to  change.  Remarkably,  the  remarkable  progress  in  the 
Netflix  Prize  contest  is  attributed  to  a  temporal  model  [85].  The  winner  identifies  strong  temporal 
patterns  in  the  data,  and  exploits  them  to  achieve  a  significant  improvement  leading  to  the  best 
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performance  attained  by  a  single  algorithm. 

One  outstanding  problem  for  many  data  is  that  they  are  often  very  sparse.  A  retailer  usually 
only  order  a  small  subset  of  shoes  from  the  whole  product  line.  In  the  Netflix  Prize1  data  set,  there 
are  17,  770  movies  and  480, 189  users,  but  only  99,  072, 112  training  ratings.  This  means  that  on 
average  each  user  has  only  rated  1.16%  of  the  movies.  This  phenomenon  presents  two  challenges 
for  us.  The  first  one  is  how  to  avoid  over-fitting,  and  the  second  is  how  to  take  advantage  of  this 
sparsity  to  accelerate  computation.  To  address  the  first  problem,  we  extend  our  approach  using 
Bayesian  techniques.  By  introducing  priors  on  the  parameters,  we  can  effectively  average  over 
various  models  and  ease  the  pain  of  tuning  parameters.  We  call  the  resulting  algorithm  Bayesian 
Probabilistic  Tensor  Factorization  (BPTF).  And  for  scalability,  we  develop  an  efficient  Markov 
Chain  Monte  Carlo  (MCMC)  procedure  for  the  learning  process  so  that  this  algorithm  can  be 
scaled  to  problems  like  Netflix. 

In  our  experiments  we  applied  our  BPTF  model  to  the  sales  prediction  problem  as  well  as 
movie  and  music  recommendation  problems.  The  empirical  results  show  that  using  the  temporal 
modeling,  consistent  improvement  of  prediction  accuracy  can  be  achieved  over  static  methods  at 
the  cost  of  little  extra  complexity  and  computation. 

The  rest  of  this  chapter  is  organized  as  follows.  First  we  introduce  some  preliminaries  about 
factorization  methods  in  Section  2.2.  In  Section  2.3  we  describe  the  proposed  model,  which  are 
enhanced  in  Section  2.4  by  Bayesian  techniques.  Some  related  work  is  discussed  in  Section  2.5. 
Section  2.6  presents  the  empirical  performance  and  efficiency  of  our  method.  Finally  we  make  our 
conclusions. 


2.2  Preliminaries 

First  we  introduce  some  background  and  notations.  Our  data  is  stored  in  a  matrix  X  e  MMx  d, 
which  can  be  considered  as  a  rating  matrix  in  collaborative  filtering,  or  simply  a  data  matrix  formed 
by  stacking  the  vectors  by  row.  Corresponding  to  the  rows  and  columns  of  X,  there  are  two  types 
of  entities  {tq}  and  {vj}.  In  collaborative  filtering,  we  call  them  “user”  and  “item”  respectively. 
They  can  also  be  “group”  and  “value”  in  the  context  of  learning  from  discrete  collective  data. 
Obviously,  there  are  M  users  and  D  items. 

The  (i,  j)th  element  of  X,  denoted  as  ry,  is  the  “rating”  that  user  i  gave  to  item  j  (or  it  could 
mean  the  number  of  times  value  j  appeared  in  group  i).  Note  that  in  collaborative  filtering  a  large 
portion  of  the  entries  in  X  is  not  observed. 

Typical  collaborative  filtering  algorithms  can  be  categorized  into  two  classes:  neighborhood 
methods  and  factorization  methods.  Generally  factor-based  algorithms  are  considered  more  effec¬ 
tive  than  those  based  on  neighborhood.  But  these  two  class  are  often  complementary  and  the  best 
performance  is  often  obtained  by  blending  them  [9] .  A  practical  survey  of  this  field  can  be  found 
in  [84], 

One  representative  factor-based  method  for  collaborative  filtering  is  the  probabilistic  matrix 
factorization  (PMF)  [141].  PMF  assigns  a  Z -dimensional  latent  feature  vector  for  each  user  and 

1  http://www.netflixprize.com/ 


18 


item,  denoted  as  u,  ,  v?  e  Mz,  and  model  each  rating  as  the  inner-product  of  corresponding  latent 
features,  i.e.  Xij  ~  u Jvj  where  u J  is  the  transpose  of  u,.  Formally,  the  following  conditional 
distribution  is  assumed: 

M  D 

p(X|U,V,a)  =  ,  (2.1) 

i= 1 }=1 

where  {uj},  {v^}  are  columns  of  U  e  MZxM  and  V  e  MZxD,  a  is  the  observation  precision,  and 
lij  is  the  indicator  that  xtJ  has  been  observed.  Zero-mean  Gaussian  prior  are  imposed  on  u*  and  v.; 
as  a  regularization. 

This  model  can  be  learned  by  estimating  the  value  of  U  and  V  using  the  maximum  a  posteriori 
(MAP)  principle.  It  turns  out  that  this  learning  procedure  actually  corresponds  to  the  following 
weighted  regularized  matrix  factorization: 

M  D  M  D 

u,  V  =  argminVyiy  (:%  -  ufv^)2  +  A;/  V'||uj||2  +  Ay  V'||vi||2.  (2.2) 

1  v  j~i 

These  formulations  reflect  the  basic  ideas  of  factorization  based  collaborative  filtering. 

The  optimization  problem  (2.2)  can  be  done  efficiently  using  gradient  descent.  This  model 
is  very  successful  in  the  Netflix  Prize  contest  in  terms  of  speed-accuracy  trade-off.  The  draw¬ 
back  though  is  that  it  requires  fine  tuning  of  both  the  model  and  the  training  procedure  to  predict 
accurately.  This  process  is  computationally  expensive  on  large  data  sets. 

We  present  the  proposed  method  in  two  parts.  First  we  extend  PMF  to  tensor  factorization  to 
model  temporal  data,  and  formulate  a  maximum  a  posteriori  (MAP)  scheme  for  estimating  the  fac¬ 
tors.  Then  we  apply  a  fully  Bayesian  treatment  to  deal  with  the  tuning  of  the  prior  parameters  and 
derive  an  almost  parameter-free  probabilistic  tensor  factorization  algorithm.  Finally  an  efficient 
learning  procedure  is  developed. 

2.3  A  Tensor  Model  for  Temporal  Data 

In  PMF  each  rating  is  determined  by  the  inner  product  of  the  user  feature  and  the  item  feature.  To 
model  their  time-evolving  behavior,  we  make  use  of  the  tensor  notation.  We  can  denote  a  rating 
as  where  i,j  index  users  and  items  as  before,  and  k  indexes  the  time  slice  when  the  rating  was 
given.  Then  similar  to  the  static  case,  we  can  organize  these  ratings  into  a  three-dimensional  tensor 
X  e  whose  three  dimensions  correspond  to  user,  item,  and  time  respectively. 

Based  on  the  idea  of  PMF,  we  assume  that  each  entry  x\j  can  be  expressed  as  the  inner-product 
of  three  Z-dimcnsional  vectors: 

z 

r ij  >=  ''f  ^  n ziv zjt zk-  (2.3) 

Z=1 

where  u, ,  v?  are  the  factors  for  users  and  items  while  t/;.  is  the  additional  latent  feature  vec¬ 
tor/factors  for  the  kth  time  slice.  Using  matrix  representations  U  =  [ui, . . . ,  Ujv]  ,  V  =  [vi, . . . ,  vM], 
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and  T  =  [tx ,  tK, we  can  also  express  Eq.  (2.3)  as  a  three-way  tensor  factorization  of  X: 

z 

X^U;,;oV,:oT,:)  (2.4) 

2=1 

where  U2>:,  V2>:  and  T2j.  represent  the  2th  rows  of  U.  V  and  T,  and  o  denotes  the  vector  outer 
product.  This  is  an  instance  of  the  CANDECOMP/PARAFAC  (CP)  decomposition  [82],  for  which 
a  illustration  is  in  Figure  2.1.  We  prefer  this  model  over  the  one  that  assigns  a  separate  factor  to 
each  entity  at  each  time  slice  because  it  will  increase  the  number  of  factors  dramatically  and  does 
not  implement  a  tensor  factorization. 

An  interpretation  of  the  factorization  (2.3)  is  that  a  rating  depends  not  only  on  how  well  a 
user’s  preferences  and  an  item’s  features  match,  but  also  on  how  much  these  features  match  with 
the  “current  trend”  reflected  in  the  time  factors.  For  instance,  if  a  user  likes  green  shoes  but  the 
overall  trend  of  this  year  is  that  few  people  wears  them  on  the  street,  then  this  user  is  probably  not 
going  to  buy  them  neither. 

To  account  for  the  randomness  in  ratings,  we  consider  the  following  probabilistic  model: 

xkij  ~  A f  (<  Uj,  Vj,  tfc  >,  cG1)  ,  (2.5) 

i.e.  the  conditional  distribution  of  x  =  given  U,  V,  and  T  is  a  Gaussian  distribution  with  mean 
<  u, .  v:/ ,  t^,  >  and  precision  a.  Note  that  if  tfc  is  an  all-one  vector  then  this  model  is  equivalent  to 
PMF.  Since  many  entries  in  X  are  missing,  estimation  based  on  the  model  (2.5)  may  over-fit  the 
observed  entries  and  fail  to  predict  the  missing  entries  well.  To  deal  with  this  issue,  we  place  prior 
distributions  on  U.  V,  and  T  to  regularize.  Specifically  we  impose  zero-mean  Gaussian  priors  on 
user  and  item  factors: 


W  ~  M  (0,  <7^I)  ,  i  —  1, . . . ,  M,  (2.6) 

Vj  -  J\f  (0,  ay!)  ,  j  =  l,...,D,  (2.7) 

where  I  is  the  D  x  D  identity  matrix. 

As  for  the  time  factors,  since  they  account  for  the  evolution  of  global  trends,  a  reasonable  prior 
belief  is  that  they  change  smoothly  over  time.  We  further  assume  that  each  time  feature  vector 
depends  only  on  its  immediate  predecessor.  Therefore,  we  can  use  the  following  conditional  prior 
for  factors  T: 

tfc  ~  M  (tfc_i,  <J2dTl)  ,  k  =  l,...,K.  (2.8) 

For  the  initial  time  feature  vector  t0,  we  assume 

t0  ~  A/”  (/iT,  0-q1)  ,  (2.9) 

where  /ir  £  is  the  prior  mean.  We  call  this  model  the  Probabilistic  Tensor  Factorization  (PTF). 

Having  the  observational  model  (2.5)  and  the  priors,  we  can  estimate  the  latent  features  U.  V, 
and  T  by  maximizing  the  logarithm  of  the  posterior  distribution,  which  takes  the  following  form 
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Figure  2.1:  CP  decomposition  of  a  three-way  tensor  X 


assuming  ratings  are  independent  given  the  latent  factors: 


log p  (U,  V,  T|X)  oc  log p  (X|U,  V.  T)  +  logp  (U,  V,  T) 

I<  M  D  M  D 

XXX  I ij  log p(xij\ui,vj,tk)  +J^logp(ui)  +  ^logp(v,)  +  ^logp(tfe|tfc_i)  +  logp(tc 

k= 1  2—1  j= 1  2—1  j= 1 

K  M  D  Tk(„k  j.  ^\2 


k= 1 


Ijj  (Xij  <  VF  tfc  >)  (#nz)  log  « 


fc=l  i=l  j=l 
M  „  l|2 


2a"1 

D 


-  Nlo^au  -  Mlo^av  -J2 

i= 1  77 

-  log  cr0  +  C, 


-  2CTy 

J=1  F 


"tfe  —  tfe-l||2  ||to  —  Ht\Y 

- A  log  adT  - 


k=  1 


2a 


dT 


2a02 


where  ll}  indicates  the  presence  of  aX,  #nz  is  the  total  number  of  ratings,  and  C  is  a  constant. 
Under  fixed  values  of  a,  <Ju,  oy ,  a, it,  er0  and  fiT,  which  are  usually  referred  to  as  hyper-parameters, 
maximizing  the  log-posterior  with  respect  to  U,  V,  T  is  equivalent  to  minimizing  the  following 
regularized  sum  of  squared  errors: 


Ar/||u,;||2 


2=1 


K  N  D  N 

Y.Y.Y,  4(4-  < u-  ^  >)2  +  Au" u* 

k= 1  i= 1  j= 1 

D  .  M  ll9  K 

+  ^2  AvolVj 

3= 1 


Avllv,  ||2  AdrllU  —  tfc-i||2  A0 1| t0  —  /ir||2 


X 

fc=i 


(2.10) 


where  A v  =  (aafj)  1,  Ay  =  (a^)  S  AdT  =  («^)  Ao  =  (aag)  X 

This  objective  function  (2.10)  is  non-convex,  and  we  may  only  be  able  to  find  a  local  minimum. 
To  optimize  it,  common  choices  include  stochastic  gradient  descent  and  block  coordinate  descent, 
both  of  which  update  the  latent  feature  vectors  iteratively.  The  estimations  U*,  V*,  and  T  can  be 
used  to  predict  an  unobserved  rating  by  the  distribution  (2.5). 
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One  issue  with  the  aforementioned  approach  is  the  tuning  of  the  hyper-parameters  a,  av ,  av,  <JdT , 
cr0  and  ht ■  Since  there  are  many,  the  usual  approach  of  hyper-parameter  selection,  such  as  cross- 
validation,  is  infeasible  even  for  a  modest  problem  size.  We  thus  propose  in  the  next  section  a 
fully  Bayesian  treatment  to  integrate  out  the  hyper-parameters  in  the  model,  leading  to  an  almost 
parameter-free  estimation  procedure. 


2.4  A  Bayesian  Treatment 

The  performance  of  PTF  depends  the  careful  tuning  of  the  hyper-parameters  when  model  param¬ 
eters  are  estimated  by  maximizing  the  posterior  probability,  as  pointed  out  in  [142].  The  point 
estimate  obtained  by  MAP  is  often  vulnerable  to  over-fitting  when  hyper-parameters  are  not  prop¬ 
erly  tuned,  and  is  more  likely  so  when  the  data  is  sparse. 

An  alternative  scheme  that  may  help  alleviate  over- fitting  is  the  Bayesian  estimation,  which 
integrates  out  all  model  parameters,  arriving  at  a  predictive  distribution  of  future  observations 
given  observed  data.  Because  this  predictive  distribution  is  obtained  by  averaging  all  models  in  the 
model  space  specified  by  the  priors,  it  is  less  likely  to  overfit  a  given  set  of  observations. 

However,  when  integrating  over  parameters  one  often  cannot  obtain  an  analytical  solution, 
thus  we  will  need  to  apply  sampling-based  approximation  methods,  such  as  Markov  Chain  Monte 
Carlo  (MCMC).  For  large-scale  problems,  sampling-based  methods  are  usually  not  preferred  due 
to  their  computational  cost  and  convergence-related  issues.  Nevertheless,  [142]  devises  an  MCMC 
procedure  for  PMF  that  can  run  efficiently  on  large  data  sets  like  Netflix.  The  main  trick  is  choosing 
proper  distributions  for  hyper-parameters  so  that  sampling  can  be  carried  out  efficiently. 

Inspired  by  the  work  of  [142],  we  present  in  the  following  a  fully  Bayesian  treatment  to  the 
PTF  model  proposed  in  Section  2.3.  We  refer  to  the  resulting  method  as  BPTF  for  Bayesian 
Probabilistic  Tensor  Factorization. 


2.4.1  Model  Specification  for  BPTF 

A  graphical  overview  of  our  entire  model  is  in  Figure  2.2,  and  each  component  is  described  below. 
The  model  for  generating  ratings  is  the  same  as  Eq.  (2.5): 

x^.|U,V,T  ~  J\T  {<  Uj,Vj,tfc  >,  a-1)  .  (2.11) 

As  before,  the  prior  distributions  for  the  user  and  the  item  feature  vectors  are  assumed  to  be  Gaus¬ 
sian,  but  the  mean  and  the  precision  matrix  (inverse  of  the  covariance  matrix)  may  take  arbitrary 
values: 


u  (pu,  A  jj1)  — 

Vj  ~  A/”  (fiV,  A^1)  ,j  =  1,...,D. 
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Figure  2.2:  The  graphical  model  for  BPTF 


For  the  time  factors,  we  make  the  same  Markovian  assumption  as  in  Section  2.3  and  consider  the 
priors: 


tx  ~  A/"  (^Tj  Ay1)  ’ 

tfc  ~  2V(tfe_i,  A^1)  ,  k  —  2, . . . ,  K. 

The  key  ingredient  of  our  fully  Bayesian  treatment  is  to  view  the  hyper-parameters  a,Ou  = 
{/X[7,  A[/},  Qv  =  {py,  Ay},  and  @T  =  {fiT,  AT}  also  as  random  variables ,  leading  to  a  predictive 
distribution  for  an  unobserved  rating  ;/A, 

p(xka  |X)=  (2.12) 

j  P  (®ij  |ui,  Vj ,  tfc,  a)  p  (U,  V.  T,  a,  0^,  0y,  0T|X)  d{U,  V,  T.  a,  0^,  Qv,  0T} 

that  integrates  over  the  parameters,  as  opposed  to  the  scheme  in  Section  2.3  which  simply  plugs 
the  MAP  estimates  into  Eq.  (2.5). 
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We  then  need  to  choose  prior  distributions  for  the  hyper-parameters  (the  so-called  hyper- 
priors).  For  the  Gaussian  parameters,  we  choose  their  conjugate  priors  that  simplifies  subsequent 
computations: 

p(a)  =  W(a\WQ,i>0), 

p(&u)  —  p{du\Au)p{Au)  =  Af  (po,  (PoAu)  x)  W(A[/|lFo,  u0), 
p(©v)  =  p{pv\Av)p{ Ay)  =  A f  (po,  (Po-Ay)  x)  W(Ay| W0,  v 0), 
p(@t)  =  p(pt\At)p(At)  =  J\f  (poj  (PqAt)  *)  W(A/r|Wo,  ^o)- 


Here  W  is  the  Wishart  distribution  over  Z  x  Z  random  matrix  A  with  uQ  degrees  of  freedom  and 
a  Z  x  Z  scale  matrix  W0: 


W(A|W0,  isq) 


|7Y|(i/0--D-1)/2 

B 


exp( 


2 


(2.13) 


where  B  is  the  normalizing  constant.  There  are  several  parameters  in  the  hyper-priors:  pQ,  p0,  /30, 
Wq,vq,Wq,  and  z>0;  These  parameters  should  reflect  our  prior  knowledge  about  the  specific  prob¬ 
lem  and  are  treated  as  constants  during  training.  Nevertheless,  slightly  varying  their  values  usually 
has  little  impact  on  the  final  prediction  performance,  as  often  observed  in  Bayesian  learning.  Note 
that  we  use  the  same  hyper-parameters  for  all  the  factors  for  convenience,  while  in  fact  differ¬ 
ent  priors  can  be  used  for  different  factors  if  appropriate  or  necessary.  In  our  experiment  we  set 
p0  =  0,  /30  =  1,  WQ  —I,u0  =  D,  W0  =  1,  z>0  =  1.  For  notational  convenience,  we  aggregate  the 
parameters  as  0O  =  (p0,  /So,  W0,  u0,  W0  z>0}  and  0  =  {0/y,  0y,  0T,  ©o}- 


2.4.2  Learning  by  Markov  Chain  Monte  Carlo 

The  predictive  distribution  (2.12)  involves  a  multi-dimensional  integral  that  cannot  be  computed 
analytically.  We  thus  resort  to  approximation  techniques.  The  main  idea  is  to  view  Eq.  (2.12)  as  an 
expectation  of  p  (x^-|u j,  v;,  t/,,  a)  over  the  posterior  distribution  p  (U,  V,  T.  a,  Ou,  (~)y,  0t|X), 
and  approximate  the  expectation  by  an  average  of  samples  drawn  from  the  posterior  distribution. 
Since  the  posterior  is  too  complex  to  directly  sample  from,  we  apply  a  widely-used  indirect  sam¬ 
pling  technique,  Markov  Chain  Monte  Carlo  (MCMC)  [66,  116,  117].  The  method  works  by 
drawing  a  sequence  of  samples  from  some  proposal  distribution  such  that  each  sample  depend- 
s  only  on  the  previous  one,  thus  forming  a  Markov  chain.  When  the  sampling  step  obeys  certain 
properties,  the  most  notably  being  detailed  balance ,  the  chain  converges  to  the  desired  distribution. 
Then  we  collect  a  number  of  samples  and  approximate  the  integral  in  Eq.  (2.12)  by 

L 

P&ij lx)  ~  ^k\a{l))r  (2A4) 

(=1 

where  L  denotes  the  number  of  samples  collected  and  v^,  t®,  and  otl)  are  the  Zth  samples. 
A  detailed  description  of  MCMC  can  be  found  in  [139]. 

There  are  quite  a  few  different  flavors  of  MCMC.  Here  we  choose  to  use  the  Gibbs  sampling 
paradigm  [56].  In  Gibbs  sampling,  the  target  random  variables  are  decomposed  into  several  disjoint 
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subsets  or  blocks,  and  at  each  iteration  a  block  of  random  variables  is  sampled  while  all  the  others 
are  fixed.  All  the  blocks  are  iteratively  sampled  until  convergence.  Such  a  scheme  is  very  similar  to 
the  nonlinear  Gauss-Seidel  method  (Chapter  2.7,  [10])  for  nonlinear  optimization,  which  optimizes 
iteratively  over  blocks  of  variables. 

As  indicated  by  its  parametrization,  our  target  distribution  p  (U,  V,  T.  a,  0y,  Qv ,  0T|X)  has 
an  inherent  block  structure  of  the  random  variables.  In  the  appendix  of  this  chapter  we  show 
that  such  a  block  structure,  together  with  our  choice  of  model  components  in  Section  2.4.1,  gives 
rise  to  conditional  distributions  that  are  easy  to  sample  from,  leading  to  an  efficient  Gibbs  sampling 
procedure  as  outlined  in  Algorithm  2.  It  has  two  notable  features:  1)  the  only  distributions  that  need 
to  be  sampled  are  multivariate  Gaussian  distributions  and  the  Wishart  distribution;  2)  individual 
user  feature  vectors  can  be  sampled  in  parallel,  and  so  can  individual  item  vectors. 


Algorithm  2  Gibbs  sampling  for  BPTF 

Initialize  model  parameters  (U(1\  V(1\  T(1^}. 

For  1=1, L, 

•  Sample  the  hyper-parameters  ,  0$ ,  Oy  ,Qj)  according  to  (2.16),  (2.17),  (2.18)  and 
(2.19),  respectively. 

•  For  i  =  1, . . . ,  M,  sample  the  user  factors  {u,  }  (in  parallel)  according  to  (2.20). 

•  For  j  —  1, . . . ,  D,  sample  the  item  factors  { v; }  (in  parallel)  according  to  (2.21). 

•  For  k  —  1, . . . ,  K,  sample  the  time  factors  {t^}  according  to  (2.22). 


2.4.3  Scalability  and  Practical  Issues 

In  our  implementation,  the  PTF  model  is  optimized  using  alternating  least  squares,  which  is  a 
block  coordinate  descent  algorithm  that  optimizes  one  user  or  one  item  at  each  time.  The  BPTF 
model  is  learned  using  Gibbs  sampling  as  described  in  Algorithm  2.  Both  of  them  are  efficient  and 
scalable  for  large  data  sets. 

Let  #nz  be  the  number  of  observed  ratings  in  the  training  data.  For  each  iteration,  the  time 
complexity  for  both  PTF  and  BPTF  is  0(# nz  x  Z2  +  (M  +  D  +  K)  x  Z3).  Typically,  the 
term  (#nz  x  Z 2)  is  much  larger  than  the  others  so  the  complexity  grows  linearly  with  respect  to 
the  number  of  observations.  For  the  choice  of  Z,  in  our  experience  using  tens  of  latent  features 
usually  achieves  a  good  balance  between  speed  and  accuracy.  Inevitably,  the  running  time  of 
BPTF  is  slower  than  the  non-Bayesian  PMF,  which  has  a  complexity  of  0(#nz  x  Z)  for  each 
iteration  using  stochastic  gradient  descent.  But  using  PMF  involves  a  model  selection  problem. 
Typically  parameters  A u  and  Ay  have  to  be  tuned  along  with  the  early- stopping  strategy.  This 
process  can  be  prohibitive  for  large  data  sets.  On  the  other  hand,  BPTF  eliminates  the  existence  of 
hyper-parameters  by  introducing  priors  for  them.  Therefore,  we  can  set  the  priors  according  to  our 
knowledge  and  let  the  algorithm  adapt  them  to  the  data.  Empirically,  good  results  can  be  obtained 
without  any  tuning. 

When  using  MCMC,  a  typical  issue  is  the  convergence  of  sampling.  Theoretically,  the  results 
generated  are  only  accurate  when  the  chain  has  reached  its  equilibrium.  This  however  would  usual- 


25 


ly  take  a  long  time  and  there  is  no  effective  way  to  diagnose  the  convergence.  To  alleviate  this,  we 
use  the  MAP  result  from  PMF  to  initialize  the  sampling.  Then  the  chain  usually  converges  within 
a  few  hundreds  samples  from  our  experience.  Moreover,  we  found  that  the  accuracy  increases 
monotonically  as  the  number  of  samples  increases.  Therefore  in  practice  we  can  just  monitor  the 
performance  on  validation  sets  and  stop  sampling  when  the  improvement  from  more  samples  is 
diminishing. 


2.5  Related  Work 

There  is  a  lot  of  work  on  factorization  methods  for  collaborative  filtering,  among  which  the  most 
well-known  one  is  Singular  Value  Decomposition  (SVD),  which  is  also  called  Latent  Semantic 
Analysis  (LSA)  in  the  language  and  information  retrieval  communities.  Based  on  the  LSA,  prob¬ 
abilistic  LSA  [72]  was  proposed  to  provide  the  probabilistic  modeling,  and  further  latent  Dirichlet 
allocation  (LDA)  [14]  provides  a  Bayesian  treatment  of  the  generative  process.  Along  another 
direction,  methods  like  [28,  138,  142]  improve  the  SVD  using  more  sophisticated  factorization. 

Bayesian  PMF  (BPMF)  [142]  provides  a  Bayesian  treatment  for  PMF  to  achieve  automatic 
model  complexity  control.  It  demonstrates  the  effectiveness  and  efficiency  of  Bayesian  methods 
and  MCMC  in  real-world  large-scale  data  mining  tasks,  and  inspired  our  research.  However,  as 
mentioned  before,  BPMF  is  a  static  model  that  cannot  handle  evolving  data.  BPTF  enhance  it 
by  adapting  the  latent  features  to  include  the  time  information.  From  the  algorithmic  perspective, 
BPTF  extends  BPMF  so  that  it  can  deal  with  multi-dimensional  tensor  data  and  the  time  dimension 
is  specially  taken  care  of.  Although  BPTF  gives  more  flexibility  over  BPMF,  the  increase  of 
parameters  is  negligible  considering  that  the  number  of  time  slices  are  often  much  smaller  than 
the  number  of  entities.  Another  difference  is  that  BPMF  leaves  the  observation  precision  a  as  a 
tuning  parameter  while  our  Bayesian  treatment  covers  all  the  parameters.  There  are  also  other 
probabilistic  tensor  factorizations  such  as  Multi-HDP  [132],  Probabilistic  Non-negative  Tensor 
Factorization  [144],  and  Probabilistic  polyadic  factorization  [30].  Yet  they  are  neither  designed  for 
prediction  purpose  nor  modeling  temporal  effects. 

Temporal  modeling  has  been  largely  neglected  in  the  collaborative  filtering  community  until 
Koren  [85]  proposed  their  award  winning  algorithm  timeSVD++.  The  timeSVD++  method  as¬ 
sumes  that  the  latent  features  consist  of  some  components  that  are  evolving  linearly  over  time  and 
some  others  that  are  dedicated  bias  for  each  user  at  each  specific  time  period.  This  model  can 
effectively  capture  local  changes  of  user  preferences  (i.e.  each  user  is  involving  independently) 
which  the  authors  claim  to  be  vital  for  improving  the  performance.  On  the  other  hand,  BPTF  tries 
to  capture  the  global  effect  of  time  that  are  shared  among  all  users  and  items.  For  our  sales  predic¬ 
tion  purpose  we  argue  that  modeling  the  evolution  of  the  overall  market  would  be  more  effective 
since  the  behavior  of  retailers  are  not  very  localized  and  the  data  is  very  sparse. 

Real  data  sets  are  rarely  stationary.  Recently,  several  algorithms  aimed  at  learning  the  evolution 
of  data  were  proposed.  Tong  et  al.  [160]  proposed  an  online  algorithm  to  efficiently  compute  the 
proximity  in  a  series  of  evolving  bipartite  graphs.  Ahmed  and  Xing  [2]  added  dynamic  components 
to  the  LDA  to  track  the  evolution  of  topics  in  a  text  corpus.  Sarkar  et  al.  [143]  considers  the 
dynamic  graph  embedding  problem  and  uses  Kalman  Filter  to  track  the  embedding  coordinates 
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through  time.  All  these  works  reveal  the  dynamic  nature  of  various  problems. 


2.6  Experiments 

We  conducted  several  experiments  on  three  real  world  data  sets  to  test  the  effectiveness  of  BPTF. 
In  these  data  sets,  a  timestamp  is  available  for  each  rating,  which  can  thus  be  denoted  by  the  tuple 
(iij ,  Vj,  tki  x\3 ) .  The  experimental  domains  include  sales  prediction  and  online  movie  recommen¬ 
dation. 

For  comparison,  we  also  implemented  and  report  the  performance  of  PMF  and  BPMF.  When 
training  the  non-temporal  models,  the  time  information  is  dropped  so  the  actual  tuple  used  is 
(«,,  Vj,  x\3).  For  PMF,  stochastic  gradient  descent  with  a  fixed  learning  rate  (Irate)  is  adopted  for 
training,  and  its  parameters  are  obtained  by  hand  tuning  to  achieve  the  best  accuracy.  For  BPMF 
and  BPTF,  Gibbs  sampling  is  used  for  training  and  the  results  from  PMF  are  used  to  initialize  the 
sampling.  Similar  to  [142],  parameters  for  Bayesian  methods  are  set  according  to  prior  knowledge 
without  tuning.  Unless  indicated  otherwise,  parameters  used  for  priors  are  fi0  =  0,  z/0  =  D,  f30  = 
1,  W0  =  I.  p0  =  1,  %  =  where  1  e  is  a  vector  of  l’s. 

The  algorithms  are  implemented  in  MATLAB  with  embedded  C  functions. 

2.6.1  Sales  Prediction 

In  this  section  we  evaluate  the  performance  of  BPTF  on  a  sales  prediction  task  for  ECCO®,  a  shoe 
company  selling  thousands  of  kinds  of  shoes  to  thousands  of  retailer  customers  from  all  over  the 
world.  For  the  consistency  of  expression  we  still  use  “user”  to  represent  “customer”  and  “item”  to 
represent  ECCO’s  product:  shoes. 

ECCO  sells  its  shoes  in  two  seasons  each  year.  Here  we  use  “2008.1”  to  denote  the  spring 
season  of  2008  and  “2006.2”  for  the  fall  season  of  2006.  For  each  season  there  is  a  period  for 
accepting  orders.  Suppose  we  are  in  the  middle  of  current  ordering  period,  our  problem  is:  in  the 
following  part  of  this  season,  how  many  orders  of  an  item  can  be  expected  from  a  particular  user? 
The  data  we  have  is  only  the  existing  sales  record.  No  attributes  for  the  items  or  users  are  available. 
As  mentioned  in  section  2.1,  this  is  a  data  set  characterized  by  changing  preferences  and  the  fast 
emergence  and  disappearance  of  entities.  On  average  we  have  thousands  of  items  and  users  with 
only  2%  of  the  possible  entries  observed.  Moreover,  in  each  season  75  —  80%  of  the  items  and 
around  20%  of  the  users  are  new  arrivals  compared  to  the  last  corresponding  season.  All  these 
characteristics  render  it  a  particular  challenging  problem  for  collaborative  filtering. 

The  data  specification  is  as  follows.  We  have  the  sales  record  from  years  2005  to  2008  so  there 
are  4  spring  seasons  and  4  fall  seasons,  which  are  handled  separately.  For  each  season,  we  select 
a  week  as  the  cut-off  point  so  that  orders  before  this  week  will  be  used  for  training  and  the  rest 
are  for  testing.  For  example,  if  we  want  to  predict  for  orders  of  season  2008.1  after  week  40  of 
2007  (the  cut-off  point),  then  the  training  data  will  be  orders  in  seasons  (2005.1,  2006.1,  2007.1, 
2008.1}  that  happened  before  the  cut-off  and  the  testing  data  are  orders  of  2008.1  after  the  cut-off. 
We  use  a  single  cut-off  point  for  all  spring  seasons  and  another  one  for  fall  seasons.  The  resulting 
test  set  contains  15  —  20%  of  the  orders.  Note  that  this  choice  is  arbitrary  in  the  sense  that  the 
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Season 

Figure  2.3:  Performance  comparison  of  PMF,  BPMF,  and  BPTF  on  6  seasons  of  ECCO’s  sales 
data.  BPTF  outperforms  others  by  a  large  margin.  See  text  for  details. 


progress  of  the  sales  varies  from  season  to  season.  We  measure  the  performance  of  algorithms 
using  mean  absolute  error  (MAE)  for  each  order  since  it  is  the  most  relevant  quantity  for  ECCO. 

We  observed  that  the  within-season  variability  of  data  is  much  larger  than  the  cross-season  one. 
This  means  that  trends  like  “Customers  tend  to  order  formal  shoes  early  and  golf  shoes  late”  are 
strong.  Therefore,  we  assign  the  timestamp  of  each  order  according  to  the  cut-off  week  so  that  the 
latent  factors  can  evolve  within  seasons.  Concretely,  every  season  is  divided  into  early  season  and 
late  season  by  the  cut-off  week,  resulting  in  two  time  slices.  Note  that  the  data  are  not  grouped  by 
seasons,  and  all  the  test  data  are  in  the  late  season  slice.  For  each  test  tuple,  we  use  the  time  factor 
for  the  late  season  to  make  the  prediction. 

We  test  the  performance  of  three  algorithms  on  all  the  seasons  except  2005.1  and  2005.2  since 
they  do  not  have  previous  seasons.  The  parameters  are  Xu  =  Xy  =  0.1,  Irate  =  1  x  10~5  for  PMF, 
a  =  0.04  for  BPMF,  and  W0  =  0.04  for  BPTF.  BPMF  and  BPTF  both  use  the  same  initialization 
from  PMF.  50  samples  are  generated  in  sampling  when  the  accuracy  stabilizes. 

The  prediction  accuracies  are  reported  in  Figure  2.3.  We  conclude  that  our  prediction  has  an 
average  error  of  20  pairs  for  each  order,  and  the  accuracy  for  spring  seasons  are  much  lower  than 
fall.  For  all  the  seasons  BPTF  consistently  outperforms  the  static  methods  by  a  fairly  large  margin. 
Note  that  PMF  appears  better  than  BPMF  here.  The  reason  may  be  that  for  this  moderately  sized 
problem  we  are  able  tune  the  PMF  parameters  to  get  the  best  results,  while  for  BPMF  and  BPTF 
we  assign  the  parameters  by  prior  knowledge.  The  results  verify  that  we  can  enhance  the  prediction 
by  modeling  the  temporal  effects  of  data  using  BPTF. 
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PMF 

BPMF 

BPTF 

RMSE 

0.9166 

0.9083 

0.9044 

Table  2.1:  RMSE  of  PMF,  BPMF  and  BPTF  on  Netflix  data. 

2.6.2  Movie  Rating  Prediction 

To  make  the  comparison  more  transparent,  we  also  did  experiments  on  benchmark  movie  rating 
problems:  Netflix2  and  MovieFens3.  These  large-scale  data  sets  consist  of  users’  ratings  to  various 
movies  on  a  5-star  scale,  and  our  task  is  to  predict  the  rating  for  new  user-movie  pairs. 

To  measure  the  accuracy,  we  adopt  the  root  mean  squared  error  (RMSE)  criterion  as  commonly 
used  in  collaborative  filtering  literature  and  the  Netflix  Prize.  For  all  models,  raw  user  ratings  are 
used  as  the  input.  Prediction  results  are  clipped  to  fit  between  [1,5]. 

Netflix 

The  Netflix  data  set  contains  100,480,507  ratings  from  M  =  480,189  users  to  D  —  17,770 
movies  between  1999  and  2005.  Among  these  ratings,  1, 408, 395  are  selected  uniformly  over  the 
users  as  the  probe  set  for  validation.  Time  information  is  provided  in  days.  The  ratio  of  observed 
ratings  to  all  entries  of  the  rating  matrix  is  1.16%.  As  a  baseline,  the  score  of  Netflix’s  Cinematch 
system  is  RMSE  =  0.9514. 

Basically  the  timestamps  we  used  for  BPTF  correspond  to  calendar  months.  However,  since 
the  ratings  in  the  early  months  are  much  more  scarce  than  that  in  the  later  months,  we  aggregated 
several  earlier  months  together  so  that  every  time  slice  contains  an  approximately  equal  number 
of  ratings.  In  practice  we  found  that  in  a  fairly  large  range,  the  slicing  of  time  does  not  affect  the 
performance  much.  In  the  end,  we  have  K  =  27  time  slices  for  the  entire  data  set. 

Following  the  settings  in  the  BPMF  paper  [142],  we  use  Z  =  30  latent  features  to  model  each 
entity  and  set  Xu  =  Ay  =  0.015,  Irate  =  0.001  for  PMF,  a  =  2  for  BPMF,  and  Wq  =  2  for 
BPTF.  These  parameters  for  Bayesian  methods  are  set  as  constant  based  on  prior  knowledge  and 
not  tuned  for  best  accuracy.  100  samples  are  used  to  generate  the  final  prediction. 

The  prediction  accuracies  of  PMF,  BPMF,  and  BPTF  on  the  probe  set  are  presented  in  Table 
2.1.  Figure  2.4  shows  the  change  of  accuracies  as  the  number  of  sample  increases.  BPMF  shows 
a  large  improvement  over  its  non-Bayesian  ancestor  PMF,  and  BPTF  further  provides  a  steady 
increment  in  accuracy.  However,  BPTF  does  not  beat  the  RMSE  =  0.8891  result  of  20-dimensional 
timeSVD++  (quoted  from  their  paper),  which  is  the  state-of-the-art  temporal  model  for  the  Netflix. 
As  pointed  out  by  the  authors  of  timeSVD++,  the  most  important  trait  of  the  Netflix  data  is  that 
there  are  many  local  changes  of  preference  which  could  just  affect  one  user  in  one  day.  BPTF  on 
the  other  hand  aims  at  learning  the  global  evolution  thus  cannot  capture  these  changes.  However, 
modeling  the  global  changes  still  gives  us  improved  performance. 

To  generate  one  sample,  BPTF  with  Z  —  30  latent  features  took  about  9  minutes  using  about 

2http://archive.ics.uci.edu/ml/datasets/Netflix+Prize 

3  http  ://ww  w.grouplens  .org/node/7  3 
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Figure  2.4:  Convergence  curves  of  BPTF  and  BPMF  on  the  full  Netflix  data.  As  the  number  of 
samples  increase,  the  RMSE  of  Bayesian  methods  drop  monotonically.  The  RMSE  of  the  Netflix’s 
baseline  and  PMF  are  also  presented. 


5GB  RAM.  For  comparison,  BPMF  uses  6  minutes  for  one  sample.  We  ran  our  experiments  in  a 
single-threaded  MATFAB  process  on  a  2.4  GHz  AMD  Opteron  CPU  with  64  GB  RAM.  We  did 
not  use  the  parallel  implementation  because  it  involves  distributing  a  large  amount  of  data  and  the 
computational  model  provided  by  MATFAB  does  not  handle  it  well.  However,  since  each  user  and 
movie  latent  feature  vector  can  be  sampled  independently,  we  believe  that  on  more  sophisticated 
platforms,  BPTF  can  work  nicely  with  MapReduce- style  parallel  processing. 

We  also  did  a  group  of  experiments  on  a  subset  of  the  Netflix  data  constructed  by  randomly 
selecting  20%  of  the  users  and  20%  of  the  movies.  It  consists  of  M  —  95,  992  users,  D  =  3,  565 
movies,  and  4, 167,  600  ratings.  This  subset  is  further  divided  into  training  and  testing  sets  by 
randomly  selecting  10  ratings  (or  1/3  of  the  total  ratings,  whichever  is  smaller)  from  each  user  as 
the  testing  set.  This  sampling  strategy  is  similar  to  the  way  that  the  Netflix  Prize  did  it.  Finally  the 
new  data  set  contains  about  4%  of  the  original  data  set  and  is  thus  suitable  for  detailed  experimental 
analysis.  In  the  training  process,  parameters  are  Xu  =  Xv  =  0.03,  Irate  =  0.001  for  PMF,  and  for 
Bayesian  methods  the  same  parameters  as  for  full  data  are  adopted. 

Firstly,  we  investigate  the  performance  of  algorithms  as  the  number  of  factors  varies.  For 
dimensions  10,  20,  50,  and  100,  the  curves  of  convergence  are  shown  in  Figure  2.5.  The  RMSE 
steadily  decreases  as  the  number  of  factors  increase,  and  no  over-fitting  is  observed.  When  using 
100  factors,  there  are  on  average  two  parameters  for  a  single  rating.  This  clearly  shows  the  effect 
of  model  averaging  using  Bayesian  technique.  Also  by  comparing  the  curves  of  BPTF  and  BPMF, 
we  see  that  BPTF  with  20  factors  performs  similarly  to  BPMF  with  100  factors.  This  demonstrates 
the  advantage  of  temporal  modeling  considering  that  the  number  of  parameters  in  BPMF  is  about 
5  times  more  than  BPTF. 
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Figure  2.5:  Convergence  curves  of  BPMF  and  BPTF  with  different  number  of  factors  on  a  subset 
of  Netflix  data.  The  accuracy  increases  when  more  factors  are  used,  and  no  over- fitting  is  observed. 
Also,  BPTF  with  20  factors  achieves  similar  performance  as  BPMF  with  100  factors. 


We  further  examine  the  significance  of  the  improvement  of  BPTF  over  the  BPMF  by  repeating 
the  prediction  tasks  20  times  using  different  random  test  sets.  The  resulting  box  plot  of  RMSEs 
are  shown  in  figure  2.6a.  The  71- value  of  paired  t-test  between  the  results  of  BPMF  and  BPTF  is 
1.3  x  HP  12 .  In  fact,  in  all  runs,  BPTF  always  produce  better  results  than  BPMF. 


MovieLens 


The  MovieLens  data  set  contains  1,  000,  209  movie  ratings  from  M  =  6,  040  users  and  D  =  3,  706 
movies  between  April,  2000  and  February,  2003,  with  the  restriction  that  each  user  has  at  least  20 
ratings.  The  ratio  of  observed  ratings  is  round  4.5%.  Time  information  is  provided  in  seconds.  We 
randomly  select  10  ratings  from  each  user  as  the  test  set,  which  is  roughly  6.5%  as  large  as  the 
training  set.  The  timestamp  used  for  BPTF  corresponds  to  calendar  months.  We  also  use  Z  =  30 
latent  features  here.  The  parameters  for  PMF  are  \v  =  \v  =  0.05,  Irate  =  0.001  as  in  [31],  and 
the  parameters  for  Bayesian  methods  are  the  same  as  for  Netflix. 

Figure  2.6b  shows  the  performance  of  three  algorithms  from  20  random  runs.  This  result  is 
similar  to  what  we  have  for  the  Netflix  data.  BPTF  still  consistently  outperforms  BPMF,  and  the 
p- value  of  paired  t-test  between  them  is  8.9  x  10-15. 
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(a)  (b) 

Figure  2.6:  RMSE  of  PMF,  BPMF,  and  BPTF.  (a)  On  a  subset  of  Netflix  data,  (b)  On  MovieLens 
data.  Lower  RMSE  is  better. 

2.6.3  Music  Recommendation 

We  also  test  the  algorithms’  performances  on  the  recently  released  Yahoo  Music 4  data  sets.  This 
data  set  is  very  similar  to  the  previous  movie  rating  data  sets,  except  that  instead  of  movies  we  are 
dealing  with  music  records.  It  contains  1,  000,  990  users,  624,  961  music  records,  and  252,  800,  275 
ratings  on  3,  974  different  days.  The  ratings  are  given  on  the  scale  between  0  and  100. 

A  subset  of  this  data  set  is  used.  As  in  the  Netflix  experiment,  we  first  randomly  sample  20% 
users  and  music  from  the  data  set,  and  then  remove  users  that  has  less  than  50  ratings.  The  resulting 
data  set  contains  38,  685  users,  24,  502  music,  and  6,  798, 119  ratings.  In  this  subset,  only  0.7%  of 
the  ratings  are  observed.  The  time  index  of  ratings  are  discretized  into  30  equally  sized  bins.  The 
ratings  are  scaled  to  the  range  [1,  5],  the  same  as  the  movie  recommendation  problems. 

To  evaluate  the  performance,  in  each  round  we  randomly  select  10  ratings  from  each  user  to 
construct  the  training  set.  Performances  of  20  random  runs  are  reported  in  Figure  2.7.  BPTF  still 
consistently  performs  the  best,  showing  the  advantage  of  temporal  models.  We  can  also  see  that 
the  non-Bayesian  PMF  model  exhibits  clear  over- fitting  behaviors  when  the  number  of  factors  is 
large. 

2.7  Summary 

We  present  the  Bayesian  Probabilistic  Tensor  Factorization  (BPTF)  algorithm  for  modeling  tem¬ 
porally  evolving  data.  By  introducing  a  set  of  additional  time  features  to  traditional  factorization 
algorithms,  and  imposing  a  smoothness  constraint  on  those  factors,  BPTF  is  able  to  learn  the  global 
evolution  of  latent  factors.  An  efficient  MCMC  procedure  is  proposed  to  realize  automatic  mod¬ 
el  averaging  and  largely  eliminates  the  need  for  tuning  parameters  on  large-scale  data.  We  show 

4http  ://kddcup  .yahoo,  com/datasets .  php 
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Performance  on  Yahoo  Music  data 


Figure  2.7 :  Prediction  RMSE  on  the  Yahoo  Music  data  set  using  different  methods  and  different 
number  of  latent  factors. 


extensive  empirical  results  on  several  real-world  data  sets  to  illustrate  the  advantage  of  temporal 
model  over  static  models. 

In  future  works,  we  may  adopt  other  types  of  observational  models  such  as  the  exponential 
family  distributions.  Gaussian  model  has  been  extensively  used  for  rating  data  and  proved  to  be 
very  effective.  However,  it  might  be  better  to  use  transformations  [141]  or  other  distributions  to 
handle  the  ratings  that  are  discrete  and  have  limited  support.  Similarly  for  the  sale  prediction 
problem,  a  Poisson  model  might  be  better  suited.  However,  these  changes  may  lead  to  more 
complicated  posterior  distributions.  We  can  then  consider  the  more  general  Metropolis-Hastings 
sampling  techniques  such  as  [134]. 


Detailed  Derivation 

In  this  section  we  give  explicit  forms  for  the  conditional  distributions  used  in  Algorithm  2.  Ac¬ 
cording  to  our  model  assumption  in  Figure  2.2,  the  joint  posterior  distribution  can  be  factorized 
as 


p(U,V,T.  a,Qv,  Qv,  ©r|X)  (2.15) 

a  p(X|U,  V.  T,  a)p(\J\Qu)p(\\Qv)p(T\eT)p(eu)p(Qv)p(QT)p(a). 

By  plugging  into  Eq.  (2.15)  all  the  model  components  described  in  Section  2.4.1  and  carrying 
out  proper  marginalization,  we  derive  the  desired  conditional  distributions  in  the  following  two 
subsections. 


33 


2.7.1  Hyper-parameters 

By  using  the  conjugate  prior  for  the  rating  precision  a,  we  have  that  the  conditional  distribution  of 
a  given  X,  U,  V  and  T  follows  the  Wishart  distribution: 

p(a |X,  U,  V.  T)  =  W(a\W*,  u*),  (2.16) 

K  M  D 

"5  =  "e  +  X  X  X  4 

k= 1  i=l  j=l 
K  M  D 

(w;)-1  =  w'0-‘  +  XXX4  (4-  <  Ui.vj.t*  >)2. 

fc=l  i= 1  j=l 

For  G;/  =  (yUf/,  A[/},  our  graphical  model  assumption  in  Figure  2.2  suggests  that  it  is  con¬ 
ditionally  independent  of  all  the  other  parameters  given  U.  We  thus  integrate  out  all  the  random 
variables  in  Eq.  (2.15)  except  U  and  obtain  the  Gaussian- Wishart  distribution: 

p(©ir|U)  =  Af(pu\p*0,  (0*Au)~1)W(Au\W*,  v*), 
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Similarly,  @y  =  {pv,  Ay}  is  conditionally  independent  of  all  the  other  parameters  given  V,  and 
its  conditional  distribution  has  the  same  form: 


p(0y|V)  =  Af(pv  \p*0,  (0*oAV)~1)W(AV\W*,  V*), 
0o  Po  +  DV 


(2.18) 
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Finally,  0T  =  { //7- ,  Ay}  is  conditionally  independent  of  all  other  parameters  given  T,  and  its 
conditional  distribution  also  follows  Gaussian- Wishart  distributions: 


p(0t|T)  =  AT(pt\p*0,  (0*Aur1)W(ATlWo0 13*), 
f  1  +  0oPo 


(2.19) 
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2.7.2  Model  parameters 


We  first  consider  the  user  features  U.  According  to  the  graphical  model  in  Figure  2.2,  its  condi¬ 
tional  distribution  factorizes  with  respect  to  individual  users: 

N 

p(U|X,  V,  T.  a,  0)  =  n,(w|X,  V,  T.  a,  ©„). 

i=l 

We  then  have,  for  each  user  feature  vector, 

p(ui|X,  V,  T,  cc,  ©f/)  =  J\f  (Uj|/x* ,  (A*)-1),  (2.20) 

K  D 
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where  Qjk  =  vy-  0  tk  is  the  element-wise  product  of  v;  and  tk.  For  the  item  features  V  the 
conditional  distribution  factorizes  with  respect  to  individual  items,  and  for  each  item  feature  vector 
we  have 


p(Vj |X,  U.  T,  a,  Qv)  =  Nielli),  (A*)"1), 

K  M 
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(2.21) 


where  =  u,  ©  tfe. 

Regarding  the  time  features,  the  conditional  distribution  of  tfc  is  also  a  Gaussian  distribution: 


p(tfc|X,  U,  V,  T_fc, 


«,©t)  —  -A/" (tfc|/z^,  (A£)  x), 


(2.22) 


where  T_fc  denotes  all  the  time  feature  vectors  except  tk.  The  mean  vectors  and  the  precision 
matrices  depend  on  k  in  the  following  way: 

For  k  —  1, 
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For  k  =  K, 
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Chapter  3 

Handling  Outliers  by  Robust  Factorization 


This  chapter  focuses  on  the  outliers/anomalies  in  the  data  and  proposes  an  algorithm  to  improves 
robustness  of  factorization  methods. 

Matrix  factorization  methods  are  extremely  useful  in  many  data  mining  tasks,  yet  their  perfor¬ 
mances  are  often  degraded  by  outliers.  In  order  to  alleviate  the  influence  of  outliers,  we  directly 
formulate  the  robust  factorization  problem  as  a  matrix  approximation  problem  with  constraints  on 
the  rank  of  the  matrix  and  the  cardinality  of  the  outlier  set.  Then,  unlike  existing  methods  that  re¬ 
sort  to  relaxations,  we  solve  this  problem  directly  and  efficiently.  In  addition,  structural  knowledge 
about  the  outliers  can  be  incorporated  to  find  outliers  more  effectively.  We  applied  this  method 
in  anomaly  detection  tasks  on  various  data  sets.  Empirical  results  show  that  this  new  algorithm 
is  effective  in  robust  modeling  and  anomaly  detection,  and  our  direct  solution  achieves  superi¬ 
or  performance  over  the  state-of-the-art  methods  based  on  the  L ,  -  norm  and  the  nuclear  norm  of 
matrices. 


3.1  Introduction 

Real  world  problems  almost  always  involve  data  that  do  not  conform  to  the  assumptions  we  made 
in  our  models.  These  data  are  called  outliers  or  anomalies.  These  outliers  can  severely  degrade 
the  models’  quality  and  performances,  therefore  we  want  robust  methods  to  reduce  the  impact  of 
outliers.  In  novelty  detection  problems,  we  are  also  interested  in  finding  and  studying  these  outliers 
since  they  might  lead  to  discoveries.  To  do  this,  we  also  need  reliable  models  that  are  not  distorted 
by  outliers. 

The  definition  of  outlier  varies  depending  on  the  application  and  the  behavior  of  data  we  want  to 
capture.  A  popular  assumption  is  that  the  normal  data  are  close  together,  and  consequently  outliers 
are  far  away  from  the  others  i.e.  lie  in  the  low-density  region  of  the  data  distribution  [21,  182].  For 
a  survey  of  the  outlier  detection  field  readers  can  refer  to  [26].  In  this  work,  we  consider  another 
common  definition  called  the  subspace  outlier ,  which  comes  from  the  assumptions  that  the  normal 
data  reside  in  a  low-dimensional  linear  subspace,  which  the  outliers  lie  outside  of.  This  means, 
for  example  in  signal  processing,  that  a  normal  signal  can  be  reconstructed  by  a  few  bases.  If  a 
signal  cannot  be  well  reconstructed  by  these  bases,  it  is  an  outlier.  This  subspace-based  modeling  is 


37 


widely  used  in  various  problems  such  as  dimensionality  reduction,  signal/image  processing,  time 
series  analysis,  and  collaborative  filtering. 

Matrix  factorization  techniques,  such  as  principal  component  analysis  (PCA)  and  non-negative 
matrix  factorization  (NMF)  [93],  are  extremely  useful  in  learning  subspace  structures  from  data. 
However,  traditional  methods  are  prone  to  be  distorted  by  outliers  [86].  Since  factorizations  are 
usually  done  by  minimizing  the  error  made  by  the  model,  a  popular  way  of  achieving  robustness 
is  to  use  error  measurements  that  are  insensitive  to  outliers.  Though  being  pervasively  used,  the 
mean  squared  error  or  the  L  >  error  measure  is  known  to  be  vulnerable  to  outliers  [176].  In  machine 
learning  and  statistics,  the  Li  error  measure  (mean  absolute  error)  is  widely  used  for  the  purpose 
of  robustness  [15,  22].  Other  measures  like  the  Huber  loss  [74]  and  the  Geman-McClure  function 
have  also  been  employed  [88,  123].  These  robust  measurements  usually  increase  the  algorithms’ 
complexities  significantly.  Another  strategy  is  to  exclude  the  outliers:  we  can  first  guess  which 
data  are  outliers,  and  then  reduce  their  influences  to  the  model  [86,  176]. 

The  contribution  of  this  work  is  to  propose  a  novel  algorithm  for  learning  robust  subspace 
models  based  on  matrix  factorization.  For  a  data  matrix  X,  we  assume  that  it  is  approximately 
low-rank,  and  a  small  portion  of  this  matrix  has  been  corrupted  by  some  arbitrary  outliers.  The 
goal  of  the  proposed  algorithm  is  to  get  a  reliable  estimation  of  the  true  low-rank  structure  of 
this  matrix.  To  achieve  this,  our  basic  idea  is  to  exclude  the  outliers  from  the  model  estimation. 
Specifically,  the  proposed  algorithm  directly  answers  the  question:  if  you  are  allowed  to  ignore 
some  data  (outliers),  what  is  the  best  low-rank  model  you  can  get? 

We  formulate  this  problem  as  a  constrained  optimization  problem.  This  formulation  aims  at 
minimizing  the  L2  error  of  the  low-rank  approximation  subject  to  that  the  number  of  ignored  out¬ 
liers  is  small,  without  any  further  assumptions.  This  formulation  reflects  our  direct  understanding 
of  outliers  and  robust  estimation.  Thus  we  call  it  direct  robust  matrix  factorization  (DRMF). 

It  can  be  shown  that  DRMF  is  the  original  problem  that  the  recently  popular  nuclear  norm 
based  methods  ( e.g .  [22,  177])  are  trying  to  solve.  However,  unlike  these  methods  that  resort 
to  relaxation  techniques,  we  directly  form  these  constraints  in  terms  of  the  matrix  rank  and  the 
cardinality  of  the  outlier  set.  Despite  that  matrix  rank  and  set  cardinality  are  often  very  difficult  to 
handle  in  optimization,  we  are  able  to  solve  this  problem  directly  in  its  original  form.  We  observe 
that  better  quality  results  are  produced  by  this  direct  solution  compared  to  the  relaxed  methods. 

We  adopt  block  coordinate  descent  to  solve  the  DRMF  problem.  The  resulting  algorithm  is 
based  on  existing  factorization  routines  such  as  the  singular  value  decomposition  (SVD),  and  ef¬ 
ficient  thresholding  procedures.  Therefore  DRMF  is  simple  to  implement,  efficient,  and  easy  to 
use.  DRMF  is  also  very  flexible:  we  can  impose  additional  constraints  on  both  the  factorization 
{e.g.  nonnegative  factors  [93])  and  the  outliers  {e.g.  outlier  columns  instead  of  entries  [177])  to 
incorporate  knowledge  for  better  performance. 

We  applied  DRMF  to  both  synthetic  and  real-world  data  sets  for  the  purpose  of  robust  modeling 
and  anomaly  detection.  We  compare  DRMF  to  its  state-of-the-art  competitors  based  on  the  nuclear 
norm  and  the  Lx  error  measurement.  Based  on  extensive  empirical  results  we  conclude  that  DRMF 
is  able  get  better  performance  than  these  relaxed  methods.  In  addition,  the  parameters  of  DRMF 
are  intuitive  and  easy  to  tune,  making  it  a  practical  tool  for  robust  analysis. 

The  rest  of  this  chapter  is  structured  as  follows.  We  introduce  background  and  notations  in 
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Section  3.1.1.  Section  3.2  describes  the  proposed  algorithm.  Related  work  and  discussions  are  in 
Section  3.3  and  3.4.  Experiments  are  presented  in  section  3.5.  Finally  our  conclusions  are  made. 

3.1.1  Background  and  Notation 

Matrices  are  very  useful  in  representing  data.  For  example,  in  regression  and  classification,  sam¬ 
ples  are  often  organized  into  a  design  matrix  in  which  each  row  represents  a  sample  and  each 
column  represents  a  feature.  The  document-word  matrix  is  often  used  for  text  data.  In  recom¬ 
mendation  systems  we  have  the  rating  matrix.  Connectivity  matrices  are  widely  used  to  express 
network  and  graph  data.  We  denote  a  data  matrix  as  X  e  rMxD.  Xi  j  denotes  the  (i,j) th  entry  of 
X.  We  also  use  the  operator  £>/(•)  to  return  an  l  x  l  diagonal  matrix  whose  diagonal  is  the  input 
vector. 

One  of  the  most  common  analysis  we  can  do  on  X  is  factorization,  as  in  principal  component 
analysis  (PCA).  We  assume  that  X  has  a  low  rank  and  can  be  factorized  as 

x  «  uvT  u  e  RMxK ,  v  e  rDxK,  (3.1) 

where  K  is  the  rank  of  the  factorization.  For  design  matrices,  factors  given  by  PCA/SVD  reveals 
the  linear  structure  and  intrinsic  dimensionality  of  the  data.  For  text  data,  latent  semantic  indexing 
(FSI)  [72]  and  nonnegative  matrix  factorization  (NMF)  [43]  is  often  applied.  The  low-rank  as¬ 
sumption  is  also  useful  in  matrix  completion  [23,  111]  and  collaborative  filtering  [138,  141]  (See 
Chapter  2). 

In  a  more  general  form,  low-rank  matrix  factorization  can  be  written  as  the  following  optimiza¬ 
tion  problem 


min  ||  X  —  L||i?  (3.2) 

L 

s.t.  rank(L)  <  K, 

where  ||-|| p  is  the  Frobenius  norm ,  L  is  the  low-rank  approximation  of  X,  and  K  is  the  maximal 
rank  of  L. 

Singular  value  decomposition  (SVD)  is  perhaps  the  most  commonly  used  tool  for  low-rank 
analysis.  SVD  decomposes  a  matrix  into  three  factors: 

i 

X  =  UP(s)VT  =  SiUivf ,  (3.3) 

i= 1 

where  l  =  min (M,  D),  s  =  [si, . . . ,  s{\  is  the  vector  of  X’s  singular  values  in  descending  order, 
columns  of  U  =  [u1; .. . ,  U/]  e  RMxl  and  V  =  [v1; . . . ,  v/]  G  RDxl  are  the  left  and  right  singular 
vectors.  The  significance  of  SVD  is  reflected  in  the  following  theorem  [48]: 

Theorem  1  (Eckart- Young).  Let  the  SVD  of  X  be  (3.3).  For  any  K  with  0  <  K  <  rank(X),  let 

K 

L K  =  U:,l:K'D(8l:K)VF1:K  =  ^  SjUjvf ,  (3.4) 

2—1 
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then 


||X-LA4|F  =  min  ||X-L||F. 

rank(L)</\ 

In  other  words,  the  rank- A'  truncated  SVD  approximation  LA-  is  a  globally  optimal  solution  to 
problem  (3.2). 

From  SVD  we  can  derive  the  nuclear  norm  of  matrices.  The  nuclear  norm  of  matrix  X  is 
defined  as  ||X||*  =  Y^t=\  si  ke.  the  sum  of  X’s  singular  values.  The  nuclear  norm  can  serve  as  a 
convex  relaxation  of  the  matrix  rank,  and  has  attracted  much  research  interest  recently.  We  shall 
discuss  more  in  Section  3.3. 

Next  we  consider  robust  error  measurement.  Let  the  error  matrix  be  E  =  X  —  L.  In  (3.2),  we 
used  the  Frobenius  norm,  ||E||F  =  Efj  a-k.a.  the  A2-norm,  to  measure  E.  The  L2-norm 

is  pervasively  used  but  is  known  to  be  sensitive  to  outliers  [86].  A  common  robust  alternative  is 
the  L i -norm  ||E||i  =  Ylij  l^jl  [15,  22],  in  which  the  errors  are  not  squared  so  the  impact  of 
large  errors  is  reduced.  A  more  aggressive  choice  is  the  Lo-norm1  ||E||0  =  Yhij  ^  0)-  The 

Ao-norm  only  counts  the  number  of  errors  disregarding  their  magnitudes. 

Recently,  structured  norms  become  popular  in  handling  problems  such  as  group  lasso  [  1 80]  and 
multitask  learning  [104]  with  structural  knowledge.  These  norms  can  also  be  used  to  incorporate 
knowledge  about  the  structure  of  outliers  (e.g.  when  outlier  entries  in  the  same  row  is  correlated) 
[177].  Here  we  introduce  the  A2ii-norm  and  L20-norm.  The  L2)i-norm  ||E|| i?2  =  Si=i  ||EV||2 
is  the  sum  of  the  L2-norm  of  rows  of  E  (i.e.  the  sum  of  the  lengths  of  the  row  vectors),  and  L2j0- 
norm  ||E||2;0  =  JA=1  7(||E,:,:||2  f  0)  is  the  number  of  non-zero  rows  in  E.  These  two  norms 
compare  similarly  as  the  L1  and  L2  norms  except  that  errors  are  measured  in  groups  according  to 
the  assumed  structure. 


3.2  Direct  Robust  Factorization 

We  adopt  the  common  assumption  that  there  is  only  a  small  amount  of  outliers  in  the  data  matrix 
X.  Then,  we  define  the  robust  low-rank  approximation  of  X  as  the  answer  to  the  question:  if  you 
are  allowed  to  ignore  some  data  as  outliers,  what  is  the  best  low-rank  approximation  ? 

A  directly  formulation  of  the  above  question  is  the  following  problem  direct  robust  matrix 
factorization  (DRMF): 


min  ||(X-S)-L||F  (3.5) 

L,S 

s.t.  rank(L)  <  K 
l|S||o<e, 


where  L  is  the  low-rank  approximation  as  before,  K  is  the  rank,  S  is  the  matrix  of  outliers ,  and 
e  is  the  maximal  number  of  non-zeros  entries  in  S  i.e.  the  maximal  number  of  entries  that  can 
be  ignored  as  outliers.  Comparing  DRMF  to  the  regular  problem  (3.2),  we  can  see  that  the  only 
difference  is  that  we  allow  the  outliers  S  to  be  excluded  from  the  low-rank  approximation,  as  long 

'Rigorously  this  Lq  measurement  is  not  a  norm. 
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as  the  number  of  outliers  is  not  too  large  i.e.  S  is  sufficiently  sparse.  Note  that  we  do  not  need  the 
actual  number  of  outliers.  Instead,  we  only  use  e  to  put  an  upper  limit  on  it. 

By  excluding  the  outliers  from  the  low-rank  approximation,  we  can  ensure  the  reliability  of  the 
estimated  low-rank  structure.  On  the  other  hand,  the  number  of  outliers  is  constrained  so  that  the 
estimation  is  still  faithful  to  the  data.  DRMF  is  advantageous  over  existing  methods  in  its  simplicity 
and  directness:  no  special  robust  error  measurement  is  introduced,  nor  do  we  make  assumptions 
about  the  outliers  beyond  necessity.  In  fact,  several  state-of-the-art  methods  are  relaxed  versions 
of  DRMF,  as  we  shall  discuss  in  section  3.3. 

3.2.1  DRMF  Algorithm 

Usually,  optimization  problems  involving  the  rank  or  the  L0-norm  i.e.  set  cardinality  are  difficult  to 
solve.  Nevertheless,  the  DRMF  problem  admits  a  simple  solution  due  to  its  decomposable  structure 
w.r.t.  variables  L  and  S.  To  take  advantage  of  this  property,  we  adopt  the  block  coordinate  descent 
strategy,  and  the  resulting  algorithm  is  described  in  Algorithm  3:  We  first  fix  S  the  current  estimate 
of  outliers,  exclude  them  from  X  to  get  the  “clean”  data  C,  and  fit  L  based  on  C.  Then,  we  update 
the  outliers  S  based  on  the  error  E  =  X  —  L. 


Algorithm  3  Direct  Robust  Matrix  Factorization  (DRMF) 

1.  Input: 

•  X  the  data  matrix. 

•  K  the  maximal  rank  of  the  factorization. 

•  e  the  maximal  number  of  outliers. 

•  S  the  initial  outliers. 

2.  While  not  converged: 

(a)  Solve  the  factorization  problem: 

L  =  argmin||C-L||F,C  =  X-S  (3.6) 

L 

s.t.  rank(L)  <  K 

(b)  Solve  the  outlier  detection  problem: 

S  =  arg  min  ||E  —  S||F,  E  =  X  —  L  (3.7) 

s 

s.t.  || S ||0  <  e 


3.  Output: 

•  L  the  robust  low-rank  approximation. 

•  S  the  outliers. 
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It  is  easy  to  see  that  the  solution  to  the  low-rank  approximation  problem  (3.6)  is  directly  given 
by  SVD  according  to  Theorem  1.  Therefore,  the  solution  to  L  is  simply  the  truncated  SVD  ap¬ 
proximation  to  C  given  in  (3.4),  which  can  be  obtained  efficiently.  Since  only  the  first  K  singular 
vectors  are  required,  we  can  further  accelerate  the  computation  using  partial  SVD  algorithms  such 
as  PROPACK  [91]. 

The  outlier  detection  problem  (3.7)  can  also  be  solved  efficiently.  To  solve  the  general  problems 
of  L0-norm  constrained  minimization  of  decomposable  objectives,  we  give  the  following  theorem 
which  extends  the  work  of  [107]: 

Theorem  2.  Let  A  be  a  domain  with  0  G  A;  A  —  (ai, . . .  ,an}  G  An;  {fi\fi  :  A  — *  M,  i  = 
1 , ...  ,  n}  be  a  set  of  n  functions  mapping  from  A’s  elements  to  real  numbers.  Also,  let  a*  = 
argmina,  ffaf);  bt  =  ff  0)  —  ffaf}  >  0;  ||A||0  be  the  number  of  non-zero  elements  in  A;  e  be  a 
positive  integer.  Then  for  the  following  problem 


min 

A 

s.t. 


f(A )  = 

2=1 

Iloilo  <  e, 


an  optimal  solutions  is  given  by  A*  =  {aj, . . . ,  a* }  with 


b-i  bi  >  6(e) 

0  otherwise 


(3.8) 


where  6(e)  is  the  eth  largest  element  in  | 

Proof.  This  theorem  is  a  slight  generalization  of  [107]  and  can  be  derived  similarly.  Denote  index 
set  J  =  {j\bj  >  6(e)}-  Clearly,  A*  is  a  feasible  solution  to  (3.8)  since  ||A*||0  <  |  J\  =  e.  For  any 
other  feasible  A'  with  1 1 A  \  o  E  e  and  A'  A*,  we  denote  index  set  J'  =  {j | o'-  f  0}.  \  .I'\  <  e,  and 
J  be  the  complement  of  J .  Then 

f(A')  -  f(A*) 

jeJ'nJ  jeJ'nJ  jeJ'nJ  jeJ'nJ 

>  E  *(«,*)- ak*)  +  E  m-u»)+  E  E  uv-fM 

jeJ'nJ  jeJ’nJ  jeJ'nJ  jeJ'nJ 

>  /i(°)  -  /i(%)  - 

jeJ'nJ  jeJ'nJ 

>  *52  bj-  ^  ^■>l^nJ|6(e>~  |J'nJ|6(e) 

jeJ'nJ  jeJ’nJ 

=  (d^l  -  \  J' n  J\)  -  (\J'\  -\J'  n  J\))  6(e) 

>o 


Therefore,  for  any  feasible  A',  we  have/ (A1)  >  f  (A*). 


□ 
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Based  on  Theorem  2,  problem  (3.7)  can  easily  be  solved  by  letting  ahJ  =  SUJ  and  = 

(. Sij  —  Eij)2.  Specifically,  the  solution  is 

Slj=\Ei’J  bh]  -  b(e)  (3.9) 

I  0  otherwise. 

where  btj  =  E2-  and  6(e)  is  the  eth  largest  value  in  This  result  is  very 

intuitive:  in  each  round,  large  errors  are  considered  outliers,  and  are  put  into  S  to  be  excluded  from 
the  low-rank  fitting  in  the  next  round. 

The  above  results  give  the  global  optima  to  step  (a)  and  (b)  in  Algorithm  3.  They  are  guaran¬ 
teed  to  improve  the  objective  value  within  the  feasible  region,  and  thus  the  algorithm  is  going  to 
converge.  In  each  iteration,  we  do  one  rank- A'  partial  SVD  plus  one  quantile  computation,  so 
the  total  complexity  is  0(MD(K  +  log(e))).  Since  K  and  e  are  usually  fixed  and  small,  DRMF 
can  handle  large-scale  problems. 

The  DRMF  problem  (3.5)  is  not  convex  due  to  the  constraints  on  the  rank  of  L  and  the  L0-norm 
of  S.  Therefore,  local  minima  exist  depending  on  the  starting  point.  This  fact  is  reflected  in  that  the 
algorithm  starts  with  an  initial  guess  of  outliers.  However,  in  experiments  we  found  that  DRMF  is 
quite  stable  w.r.t.  starting  point,  and  good  initialization  methods  exist.  More  details  can  be  found 
in  Section  3.4. 

DRMF  has  two  parameters  K  and  e  that  need  the  user’s  attention.  Yet,  their  clear  meanings  (the 
rank  and  the  maximally  allowed  number  of  outliers)  help  the  user  select  their  values.  Particularly, 
we  emphasize  that  the  value  of  e  does  not  need  to  match  the  actual  number  of  outliers.  It  is  only 
used  as  a  safeguard  to  ensure  that  not  too  many  data  are  regarded  as  outliers.  For  this  purpose  we 
can  easily  set  e  to  be  say  5%  of  the  whole  data  set.  From  Eq.  (3.9)  we  can  see  that  normal  data 
with  small  factorization  errors  will  not  be  thrown  as  outliers.  On  the  other  hand,  if  there  are  more 
than  5%  outliers,  the  ones  with  largest  errors  will  be  taken  care  of.  We  will  show  that  this  default 
behavior  gives  us  good  performance  in  various  situations  in  Section  3.5. 


3.3  Related  Work 

Matrix  factorization  is  widely  used  in  data  mining  and  machine  learning,  and  robust  subspace  anal¬ 
ysis  methods  are  of  great  value  in  practical  situations.  Many  robust  estimators  has  been  proposed 
(e.g.  [67,  79,  86,  88,  99]).  They  usually  involves  alternative  error  measurements,  complex  estima¬ 
tion  procedures,  or  problem  specific  heuristics.  On  the  other  hand,  the  DRMF  algorithm  is  both 
conceptually  and  computationally  simple:  it  excludes  some  data  and  fit  the  rest,  and  the  solution  is 
obtained  by  iteratively  applying  SVD  and  thresholding  the  errors. 

Another  limitation  of  traditional  robust  methods  is  that  performance  cannot  be  guaranteed  in 
high  dimensions  [44,  177].  Recently,  constraining  the  nuclear  norm  [23,  111]  of  the  matrix  instead 
of  its  rank  becomes  a  popular  strategy  for  overcoming  this  problem  [22,  177,  185],  and  has  been 
shown  to  outperform  traditional  algorithms.  These  methods  can  be  summarized  as  the  nuclear 
norm  minimization  (NNM)  problem.  To  compare,  we  also  rewrite  DRMF  to  one  of  its  equivalent 
lagrangian  form,  and  show  them  in  Table  3.1. 
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min 

||L||*  +  A||  S  || ! 

NNM 

L,S 

s.t. 

||X-L-S||F  <  a 

min 

rank(L)  +  A|  S|  0 

DRMF 

L,S 

S.t. 

||X-L-S||f  < 

Table  3.1:  Comparing  the  nuclear  norm  minimization  (NNM)  problem  and  DRMF.  L  is  low-rank; 
S  is  the  sparse  outlier.  ||  •  ||*  is  the  nuclear  norm;  a  is  the  allowed  approximation  error. 


We  can  immediately  see  the  relationship  between  DRMF  and  the  NNM  methods:  DRMF  min¬ 
imizes  the  rank ,  while  NNM  minimizes  the  nuclear  norm ;  DRMF  measures  outliers  by  the  L0- 
norm,  while  NNM  uses  the  Lr norm.  In  fact,  the  nuclear  norm  and  the  Li- norm  in  the  NNM 
problem  are  proposed  as  convex  relaxations  of  the  rank  and  the  L0-norm  in  the  first  place.  In  this 
sense,  DRMF  is  the  “original  problem”  that  NNM  is  trying  to  solve. 

By  using  the  relaxations,  NNM  is  convex  and  the  globally  optimal  solutions  can  be  found. 
In  addition,  theories  have  been  provided  for  choosing  A  to  guarantee  the  correct  recovery  of  the 
principal  subspace  under  certain  conditions  [22, 177] .  Yet,  it  is  unknown  how  well  these  relaxations 
approximate  the  original  problem  in  general.  On  the  other  hand,  the  original  DRMF  problem  is 
non-convex  and  has  the  local-minima  problem.  As  a  remedy,  we  can  initialize  DRMF  with  the 
NNM  results  to  obtain  results  that  are  better  than  using  either  NNM  or  DRMF  alone.  We  expand 
this  point  further  in  Section  3.4.  The  theoretical  properties  of  DRMF  are  difficult  to  analyze  due  to 
the  non-continuous  and  non-convex  nature  of  the  Co-norm  and  the  matrix  rank.  Yet  we  shall  show 
that  DRMF  can  achieve  better  empirical  performance  than  the  relaxed  NNM  methods. 

The  NNM  methods  often  set  a  =  0  for  exact  recovery  [22,  177].  Yet  real-world  noisy  da¬ 
ta  invalidate  this  choice  and  make  the  algorithm  inefficient.  When  NNM  uses  a  >  0  {e.g.  in 
[177,  185]),  it  needs  more  assumptions  to  ensure  the  theoretical  soundness  and  introduces  extra 
parameters  {e.g.  the  amount  of  Gaussian  noise)  that  need  careful  tuning.  On  the  other  hand,  DRM- 
F  can  be  applied  in  both  situations,  thanks  to  the  fact  that  it  solves  the  problem  in  the  constrained 
form  (3.5)  the  only  difference  between  noisy  and  noiseless  data  is  that  the  former  will  have  non¬ 
zero  objective  values. 


3.4  Discussion 

3.4.1  Extensions  to  Incorporating  Prior  Knowledge 

In  many  situations,  additional  knowledge  is  available  for  us  to  find  outliers.  For  example,  in  a 
design  matrix,  if  one  sample  point  has  been  corrupted,  then  it  is  very  likely  that  most  of  the  entries 
in  its  corresponding  row  are  outliers.  In  collective  data,  we  may  also  face  the  situation  where  if 
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the  observation  of  a  group  is  disrupted,  then  all  of  its  points  are  affected.  In  this  case,  we  should 
look  for  outlier  rows  so  that  evidences  of  anomalies  can  aggregated  to  enhance  the  performance. 
DRMF  can  easily  be  extended  to  handle  this  situation.  Here,  we  consider  the  outlier  patterns  to 
be  groups  of  entries  that  are  anomalous.  Instead  of  counting  the  number  of  outlier  entries,  we  can 
count  the  number  of  outlier  patterns  using  structured  norms  such  as  the  L20-norm.  Concretely,  the 
following  DRMF-Row  (DRMF-R)  problem  handles  row  outliers: 

min  ||(X-S) -L||f  (3.10) 

L,S 

s.t.  rank(L)  <  K 
II S || 2,0  <  e, 

where  e  is  the  maximal  number  of  outlier  rows  allowed.  DRMF-R  can  be  solved  by  replacing  step 
(b)  in  Algorithm  3  with  the  following  problem: 


S=  argmin  ||E  —  S||F,  E  =  X  —  L  (3.11) 

s 

s.t.  || S || 2,0  <  e. 


Row-wise  outliers  has  also  been  considered  in  outlier  pursuit  (OP)  [177].  OP  extends  the  NNM 
algorithm  by  using  the  L2,i-norm  to  capture  outlier  rows.  Not  surprisingly,  OP  is  the  convex 
relaxation  of  the  DRMF-R  problem  (3.10). 

Problem  (3.11)  can  also  be  solved  based  on  Theorem  2  by  treating  each  row  of  S  as  an  element. 
Without  giving  details,  we  show  that  the  solutions  is: 


l i  7^  1(e) 

0  otherwise, 


(3.12) 


where  U  =  ||E,  :||F  and  /(e)  is  the  eth  largest  value  among  { h}i=i,...,M ■  Again,  the  solution  is 
obtained  efficiently  by  thresholding.  In  fact,  it  is  very  easy  to  capture  arbitrarily  shaped  outlier 
patterns  to  accommodate  specific  problems. 

Finally,  the  low-rank  component  of  DRMF  can  also  be  extended.  For  example,  we  can  require 
the  factor  matrices  in  (3.1)  to  be  non-negativity  as  in  non-negative  matrix  factorization  (NMF) 
[43].  To  do  this,  we  replace  the  constraint  rank(L)  <  K  in  (3.5)  by  the  explicit  factorization 
form  L  =  UV3  and  then  impose  non-negativity  constraints  on  U  and  V.  DRMF  can  also  easily 
be  extended  to  handle  missing  values  in  collaborative  filtering.  Fast  and  pass-efficient  algorithms 
such  as  [124]  can  also  be  integrated  into  DRMF  to  do  robust  analysis  on  massive  data  sets. 


3.4.2  Implementation 

When  applying  DRMF,  we  need  to  answer  several  important  practical  questions:  how  to  choose 
the  parameters  e  the  maximal  number  of  outliers  allowed,  K  the  rank  of  the  factorization,  and  the 
starting  point  i.e.  the  initial  guess  of  outliers  S.  As  discussed  in  Section  3.2,  we  can  set  e  to  be 
e.g.  5%  of  the  whole  data  set  so  that  the  algorithm  is  not  ignoring  to  much  data. 
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Like  most  matrix  factorization  methods,  in  DRMF  the  rank  of  the  factorization  K  is  selected 
according  to  prior  knowledge,  cross-validation,  or  other  heuristics.  For  example,  we  can  observe 
the  singular  values  of  the  data  matrix,  and  choose  a  K  to  preserve  certain  amount  of  data  variability. 
In  some  situations,  the  value  of  K  is  constrained  by  available  computational  resources,  so  we  have 
to  make  trade-offs  between  accuracy  and  running  time. 

The  initial  guess  of  outliers  S  affects  the  final  solution,  since  DRMF  is  non-convex  and  can  be 
trapped  in  local  minima.  For  many  moderate  situations  we  found  that  the  simple  choice  of  S  =  0 
works  well.  But  in  extreme  cases  where  the  regular  SVD  is  completely  disrupted  by  outliers,  this 
simple  heuristic  would  lead  DRMF  into  irrecoverable  local  minima.  One  such  example  is  shown 
in  Figure  3.1. 


Figure  3.1:  An  example  where  DRMF  with  initial  S  =  0  would  fail.  Blue  crosses  are  normal 
points  and  the  red  circle  is  the  outlier.  Blue  arrow  shows  the  true  principle  subspace  and  the  red 
dashed  arrow  shows  the  wrong  one  DRMF  would  get  starting  from  S  =  0.  Note  that  when  starting 
from  an  S  that  correctly  indicates  the  circle  as  an  outlier,  DRMF  is  able  to  achieve  the  correct  blue 
subspace. 

We  found  that  an  effective  way  is  to  solve  this  problem  is  to  leverage  the  convexity  of  nu¬ 
clear  norm  minimization  (NNM)  methods.  Since  NNM  is  a  convex  relaxation  of  DRMF,  we  can 
first  compute  the  NNM  solution  of  S,  and  then  use  it  to  initialize  DRMF.  This  strategy  is  similar 
to  the  case  where  the  linear  programming  relaxation  is  used  to  approximate  the  original  integer 
programming  problems.  In  practice,  we  can  run  NNM  for  a  few  iterations  and  terminate  before 
convergence.  This  is  usually  enough  to  guide  DRMF  to  a  good  convergence  region.  In  this  way,  we 
can  get  results  that  are  better  than  using  either  NNM  of  DRMF  alone.  Other  methods  (e.g.  [176]) 
can  also  be  used  to  initialize  DRMF.  Using  these  initialization  schemes,  DRMF  is  able  to  overcome 
the  problem  posed  in  Figure  3.1  and  get  higher  quality  results  than  NNM. 

Very  recently  we  noticed  a  parallel  work  GoDec  [183]  that  shares  the  same  idea  with  DRMF. 
By  comparison,  DRMF  extends  to  structured  outliers  as  discussed  in  Section  3.4.1.  In  addition,  the 
non-convexity  of  DRMF/GoDec  is  not  addressed  in  [183]  and  the  GoDec  algorithm  in  its  original 
form  would  likely  get  stuck  in  the  extreme  case  in  Figure  3.1. 


3.5  Experiments 

In  this  section  we  show  the  empirical  effectiveness  of  DRMF  on  both  simulation  and  real-world 
data  sets.  We  compare  DRMF  to  the  following  state-of-the-art  competitors: 

•  Robust  PCA  (RPCA)  [22]  We  use  the  code  from  http  :  /  /perception  .  csl .  uiuc  . 
edu/matrix-rank.  The  efficient  “inexact  augmented  Lagrange  multiplier”  implementa¬ 
tion  is  used. 
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•  Stable  principal  component  pursuit  (SPCP)  [185]  We  implemented  SPCA  in  Matlab  using 
the  proximal  gradient  method  according  to  [52]. 

•  Outlier  Pursuit  (OP)  [177]  We  implemented  OP  in  Matlab  using  the  proximal  gradient 
method  according  to  [177]. 

In  terms  of  Table  3.1,  RPCA  and  SPCP  solve  the  NNM  problem  with  a  =  0  and  a  >  0  respectively; 
OP  solves  NNM  when  the  outlier  is  measured  by  ||S||2.i  and  a  —  0.  The  truncated  SVD  results 
(3.4)  are  also  provided  as  a  baseline. 

DRMF  and  DRMF-R  are  implemented  in  Matlab.  Partial  SVD  is  done  using  PROPACK  [91]. 
We  terminate  the  iteration  when  the  relative  change  of  the  objective  value  is  diminishing. 

DRMF,  SPCP,  and  OP  are  all  initialized  by  the  solution  produced  by  10  iterations  of  RPCA. 
For  DRMF,  we  always  set  the  maximal  number  of  allowed  outliers  to  be  e  =  0.05 MD  without 
tuning  unless  indicated  otherwise. 

3.5.1  Simulation  Data 

First,  we  study  the  performances  of  different  methods  on  simulated  data  sets.  We  follow  the  set  up 
in  [22]  to  create  the  data  matrix.  Let  JV  (/i,  a2)  denote  the  Gaussian  distribution  with  mean  //  and 
variance  a2,  and  U(a.  h)  denote  the  uniform  distribution  on  the  interval  [a,  b] .  We  generate  the  rank- 
K  matrix  as  L  =  UVT  e  MA/x  M,  where  entries  of  the  factor  matrices  U  and  V  are  i.i.d.  samples 
from  Gaussian  distributions  as  U  e  WMxK  ~  TV"  (0,  1/K) ,  V  €  M.MxK  ~  A/”  (0,1  /K).  To 
generate  the  outlier  matrix  S,  we  first  select  7 M2  entries  from  S  and  then  draw  their  values  from 
the  uniform  distribution  U (—07,  +<r0),  where  a0  is  the  magnitude  of  outliers.  Finally,  we  put  them 
together  and  add  i.i.d.  Gaussian  noise  for  each  entry  to  get  X  =  L  +  S  +  Af  (0,  a2),  where  an  is 
the  level  of  the  Gaussian  observation  noise. 

Recovery  Quality  and  Detection  Rate 

In  this  part  we  test  how  well  the  methods  can  detect  the  outliers  and  recover  the  underlying  low- 
rank  L  accurately.  We  compare  the  performances  on  three  different  indices.  To  measure  the 
accuracy  of  robust  modeling,  we  compute  the  root  mean  squared  error  (RMSE)  of  the  recovered 
L  w.r.t.  the  true  L.  NNM  results  are  “debiased”  as  in  [110]  to  compensate  the  shrunken  singu¬ 
lar  values.  Outlier  scores  are  computed  as  the  absolute  difference  between  the  estimated  L  and 
observation  X,  and  average  precision  (AP)  is  used  to  measure  the  detection  performance.  The 
simulation  parameters  we  use  are  K  =  0.05n,  7  =  0.05,  aQ  =  1.  Finally,  the  running  time  is  also 
compared. 

First  we  examine  the  entry  outliers,  noiseless  obser\>ation  case  by  selecting  uniformly  random 
entries  in  S  to  be  outliers  and  setting  an  =  0.  This  situation  satisfies  the  assumption  made  by 
RPCA.  We  compare  the  performances  of  SVD,  RPCA,  and  DRMF.  Note  that  SPCP  and  OP  cannot 
be  applied  to  this  data  set.  For  RPCA,  we  use  parameter  A  =  1  /sfM  as  suggested  in  [22].  For 
SVD  and  DRMF,  the  true  K  is  used  for  factorization.  Matrices  with  sizes  M  between  [100, 2000] 
are  used.  Mean  performances  of  20  random  runs  are  reported  in  Figure  3.2.  We  see  that  both 
RPCA  and  DRMF  achieved  much  better  performances  than  plain  SVD,  showing  the  necessity 
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Figure  3.2:  Performances  on  noiseless  data  with  entry  outliers.  Note  that  the  running  time  is  shown 
in  log-scale. 


Figure  3.3:  Performances  on  noisy  data  with  entry  outliers. 


and  effectiveness  of  robust  factorization.  Further,  even  in  this  noiseless  case,  DRMF  is  able  to 
outperform  RPCA  consistently,  using  much  less  running  time  (only  slightly  slower  than  partial 
SVD). 

Next  we  examine  the  entry  outliers,  noisy  obserx’ation  case.  Compared  to  the  previous  simu¬ 
lation,  we  use  on  =  0.1  and  other  settings  remain  the  same.  Note  that  this  situation  violates  the 
assumption  made  by  RPCA.  We  compare  SVD,  RPCA,  SPCP,  and  DRMF  here.  The  same  settings 
for  SVD,  RPCA,  and  DRMF  are  used  as  before.  For  SPCP,  the  parameter  regarding  the  level  of 
regular  Gaussian  noise  is  set  as  suggested  by  [185].  Mean  performances  of  20  random  runs  are 
reported  in  Figure  3.3.  On  this  data  set,  we  see  DRMF  achieves  the  best  performance  again.  RPCA 
performs  poorly  because  of  the  noise,  which  inflates  the  estimated  rank  dramatically.  SPCP,  which 
is  essentially  an  extended  version  of  RPCA  to  handle  noisy  data,  shows  much  better  accuracy  here, 
but  is  still  worse  than  DRMF.  Based  on  these  two  experiments,  we  conclude  that  DRMF  can  handle 
both  noisy  or  noiseless  data  sets,  and  is  able  to  achieve  better  results  than  RPCA  and  SPCP. 

Further  we  examine  the  row  outliers,  noisy  observation  case.  Unlike  the  entry  outlier  case, 
here  we  randomly  select  7 M  (7  =  0.05)  rows  in  S  and  fill  them  with  outliers  from  U(  —  l,  1).  Note 
that  this  situation  violates  the  assumptions  made  by  RPCA  and  SPCP.  We  compare  SVD,  RPCA, 
SPCP,  OP,  DRMF,  and  DRMF-R  here.  For  OP,  we  use  parameter  A  =  OA/y/^M  as  suggested  by 
[177].  For  DRMF-R,  we  directly  specify  that  there  can  be  7 M  outlier  rows.  Mean  performances 
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Figure  3.4:  Performances  on  noisy  data  with  row  outliers. 


of  20  random  runs  are  reported  in  Figure  3.4.  In  the  presence  of  row  outliers,  SVD  and  SPCP 
failed  to  work  for  large  matrices.  By  contrast,  OP  performs  poorly  for  small  matrices,  but  then 
catches  up  as  M  grows  larger.  The  reason  could  be  that  OP’s  suggested  settings  are  not  suitable  for 
small  problems,  where  tuning  the  parameters  by  cross-validation  might  give  better  results.  RPCA, 
DRMF,  and  DRMF-R  show  stable  performances  and  DRMF-R  beats  the  others  by  a  large  margin. 
This  verifies  that  utilizing  additional  knowledge  about  outlier  patterns  helps  robust  modeling  and 
finding  outliers.  It  is  also  interesting  to  see  even  though  DRMF  is  design  to  handle  entry  outliers, 
its  recovery  quality  is  not  affected  by  row  outliers  as  SPCP  is. 

Based  on  these  results,  we  conclude  that  DRMF  outperforms  the  NNM  methods  in  various 
cases,  including  noiseless  and  noisy  cases  as  well  as  different  outlier  patterns. 

Sensitivity 

In  this  section,  we  study  the  sensitivity  of  DRMF’s  performance  w.r.t.  the  magnitude  of  outliers 
and  values  of  parameters. 

First  we  examine  how  the  magnitude  of  outliers  affects  the  recovery  quality.  We  simulate 
noiseless  matrices  with  entry  outliers,  using  M  =  400,  K  =  20,  and  7  =  0.05.  Then  we  change 
u0  the  magnitude  of  outliers  from  1  to  105,  and  calculate  the  RMSE  between  the  recovered  L  and 
L.  Results  produced  by  RPCA  and  DRMF  are  shown  in  Figure  3.5a.  We  can  see  that  the  recovery 
quality  of  DRMF  is  not  affected  by  the  magnitude  of  outliers  at  all.  This  is  expected:  the  L0-norm 
used  in  DRMF  totally  disregards  the  magnitude  of  outliers  and  only  counts  the  number  of  them. 
On  the  other  hand,  though  being  robust,  the  Lr norm  used  in  RPCA  is  still  influenced  by  large 
outliers,  and  we  observe  that  this  influence  grows  linearly  with  the  magnitude  of  outliers. 

We  also  examine  how  the  recovery  quality  of  DRMF  is  affected  by  the  choice  of  parameters 
K  the  rank  and  e  the  number  (or  equivalently  the  proportion)  of  allowed  outliers.  The  matrices 
are  generated  in  the  same  way  as  the  previous  experiment  with  a0  —  1.  Then  we  run  DRMF 
with  different  K s  between  [14,  60]  and  different  e’s  between  [0,  0.2].  Recovery  RMSE  are  shown 
in  Figure  3.5b.  We  can  see  that  in  a  large  range  of  parameters  the  performance  is  stable.  It  is 
especially  interesting  to  see  that  moderately  larger  values  of  e  actually  produces  better  results. 
This  behavior  verifies  the  role  of  e  in  DRMF:  it  is  only  a  safeguard  to  prevent  excessive  data  being 
regarded  as  outliers,  and  it  does  not  need  to  be  same  as  the  true  number  of  outliers.  We  also  observe 
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RMS  vs  (e,  K) 


Figure  3.5:  (a)  Recovery  RMSE  of  RPC  A  and  DRMF  versus  the  outliers’  magnitude,  (b)  Recovery 
RMSE  of  DRMF  versus  the  parameters  K  the  rank  and  e  the  proportion  of  allowed  outliers.  Darker 
color  indicates  smaller  error. 


that  performance  can  be  degraded  when  using  too  small  K s  and  too  large  e’s  (>  20%).  This  is 
expected:  when  e  is  too  large,  a  large  portion  of  data  can  be  treated  as  outliers  (this  actually  violates 
our  definition  of  outliers)  and  thus  the  results  become  unfaithful.  When  K  is  too  small,  the  model 
lacks  the  capability  to  capture  the  normal  variability  of  data. 

3.5.2  Video  Background  Modeling  and  Activity  Detection 

In  this  experiment  we  consider  the  problem  of  modeling  the  background  of  videos.  Estimating  the 
background  in  videos  is  important  for  many  computer  vision  tasks  such  as  activity  detection,  yet 
also  difficult  because  of  the  variability  of  the  background  {e.g.  due  to  lighting  conditions)  and  the 
presence  of  foreground  objects  such  as  moving  people. 

Here  we  apply  robust  matrix  factorization  methods  to  solve  this  problem.  We  assume  that  the 
background  variations  in  videos  are  of  low-rank  ( i.e .  the  background  scenes  can  be  approximated 
by  linear  combinations  of  several  “basis”  images),  and  the  foreground  objets  are  sparse  outliers.  By 
applying  robust  factorization  methods  to  these  video  data,  we  want  that  the  low-rank  component 
will  capture  the  background  and  its  variations,  while  the  foreground  activities  will  be  recognized 
as  outliers  so  that  they  will  not  interfere  the  estimation  of  background. 

Video  sequences  “Hall”  (size  128  x  160,  frames  2100-2400),  “Lobby”  (size  144  x  176,  frames 
1300-1700),  “Restaurant”  (size  120  x  160,  frames  2500-3000),  and  “Shopping  Mall”  (size  128  x 
160,  frames  1500-2000)  from  [101]  are  used.  “Hall”  contains  a  relatively  static  background  and 
many  foreground  activities.  “Lobby”  contains  few  foreground  activities  and  large  background 
variations.  “Restaurant”  and  “Shopping  Mall”  are  noisier  and  contain  much  more  foreground 
activities.  Sample  images  are  shown  in  Figure  3.8. 
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Figure  3.6:  Video  activity  detection  performance 


We  flatten  and  stack  the  video  frames  into  a  matrix,  with  one  row  corresponds  to  a  frame.  Then 
we  use  SVD,  RPCA,  SPCA,  DRMF  to  estimate  the  background.  The  anomaly  scores  of  pixels 
are  computed  as  the  absolute  difference  between  the  estimated  background  and  the  observation, 
so  our  hope  is  that  pixels  corresponding  to  foreground  activities  will  receive  high  scores.  The 
performance  is  measured  by  the  average  precision  of  detecting  foreground  pixels  on  the  ground 
truth  frames.  We  use  the  suggested  parameters  for  RPCA  and  SPCA  (For  SPCA,  the  median  of 
pixels’  standard  deviation  is  used  to  estimate  the  Gaussian  noise  level).  For  SVD  and  DRMF,  rank- 
5  models  are  used  for  “Hall”,  “Lobby”  and  rank-7  models  are  used  for  “Restaurant”,  “Shopping 
Mall”  to  capture  the  background  variations. 

Detection  results  on  some  ground- truth  frames  using  DRMF  and  RPCA  are  shown  in  Figure 
3.8.  Both  methods  are  able  to  separate  the  foreground  and  background  and  produce  good  results. 
By  more  detailed  examination,  we  can  see  that  the  backgrounds  images  captured  by  DRMF  are 
smoother  and  contains  less  artifacts  than  RPCA.  Figure  3.6  shows  the  detection  performance  and 
running  time  of  different  methods.  Again,  we  see  that  DRMF  consistently  gives  better  detection 
performance  than  RPCA  and  SPCP 

3.5.3  Hand-written  Digit  Modeling 

In  the  last  experiment,  we  use  these  factorization  methods  to  find  anomalous  digit  images.  The 
assumption  is  that  images  of  the  same  digits  have  a  low-rank  structure  ( i.e .  these  images  reside  in  a 
low-dimensional  subspace),  and  if  we  inject  in  a  small  amount  of  different  digits,  these  injections 
will  violate  the  low-rank  structure  and  stand  out  as  outliers. 

We  use  digits  ‘1’  and  ‘7’  from  the  USPS  data  set  as  in  [177].  The  image  size  is  16  x  16.  We 
select  a  data  set  that  is  a  mixture  of  220  images  of  ‘1’  and  11  images  of  ‘7’.  The  goal  is  to  detect  all 
the  ‘7’s  in  an  unsupervised  way.  To  do  this,  we  flatten  all  images  as  row  vectors  and  stack  them  into 
a  231  x  256  matrix  X.  Then,  factorization  methods  are  applied  to  estimate  low-rank  matrices  L 
which  are  expected  to  capture  the  ‘l’s.  Finally,  each  image  (a  row  of  X)  is  scored  by  the  L2-norm 
of  its  corresponding  row  in  the  error  matrix  X  —  L.  Ideally,  ‘7’s  should  receive  higher  scores  than 
‘l’s. 

We  compare  SVD,  RPCA,  SPCP,  DRMF,  OP,  DRMF-R  on  this  task,  for  SVD  and  DRMF 
methods,  rank  K  =  3  is  used.  For  NNM  methods,  suggested  parameters  are  used  as  before. 
Performances  are  measured  by  the  average  precision  of  detecting  ‘7’s.  In  each  run,  we  randomly 
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re-select  the  images.  Results  of  20  random  runs  are  shown  in  Figure  3.7a. 


(a)  (b) 


Figure  3.7:  USPS  anomaly  detection  results,  (a)  the  average  precisions  of  detecting  ‘7’s  among 
‘l’s.  (b)  images  ranked  by  their  anomaly  scores  in  the  descending  order. 

We  can  see  that  DRMF-R  gives  the  best  results,  showing  the  advantage  of  our  direct  solution, 
and  the  benefit  from  incorporating  knowledge  about  the  outliers’  structure.  On  the  other  hand, 
RPCA  and  SPCP  failed  in  this  case,  since  the  non-uniform  and  non-random  outliers  in  this  data  set 
violate  their  basic  assumptions.  The  difference  between  OP  and  DRMF-R  is  significant:  a  paired 
t-test  gives  a  p-value  of  0.95  x  10-6.  Figure  3.7b  shows  a  list  of  images  ranked  by  their  anomaly 
scores.  We  conclude  that  the  ‘l’s  are  clearly  captured  by  the  low-rank  structure  in  DRMF,  and  it  is 
interesting  to  observe  the  behavior  of  the  (1,  2)th  and  the  (2,  5)th  image. 

3.6  Summary 

We  proposed  the  direct  robust  matrix  factorization  (DRMF)  algorithm  as  a  simple  and  effective 
way  for  robust  low-rank  factorizations  and  outlier  detection.  We  start  from  the  fundamental  notion 
of  outliers  and  use  a  direct  formulation  to  address  this  problem.  DRMF  is  conceptually  simple 
(SVD  +  error  thresholding),  easy  to  implement  (about  10  lines  of  Matlab  code),  efficient  (linear 
complexity  w.r.t.  number  of  entries),  and  flexible  to  incorporate  prior  knowledge  about  both  the 
outliers  and  the  low-rank  structure. 

DRMF  is  compared  to  the  recently  proposed  nuclear  norm  minimization  (NNM)  family  meth¬ 
ods.  We  show  that  NNM  methods  are  in  fact  convex  relaxations  of  DRMF.  In  extensive  empirical 
evaluations  we  find  that  the  solutions  given  by  DRMF  achieve  better  performances  over  the  state- 
of-the-art  competitors  that  use  relaxations,  showing  the  advantage  of  our  direct  formulation. 

The  factorization  algorithms  proposed  in  this  Chapter  and  Chapter  2  are  widely  useful  when 
learning  from  discrete  collective  data  as  well  as  other  vector  or  matrix  based  data  format. 

3.7  Automatic  Novelty  Discovery  for  Astronomy 

Using  the  factorization  techniques,  we  have  developed  an  automatic  system  for  real-time  novel¬ 
ty  discovery  in  astronomical  survey  data  described  in  Section  1.5.  From  the  ongoing  SDSS  III 
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project2,  we  can  get  daily  updates  of  the  new  objects  observed  by  the  telescope.  The  goal  is  to 
develop  a  system  that  can  examine  these  new  objects  in  real-time,  and  automatically  pick  out  the 
potentially  interesting  ones  to  present  them  to  the  astronomers  for  further  examination.  Then  we 
collect  feedback  from  the  astronomers  to  support  further  studies. 

One  of  the  goals  of  this  system  is  to  find  objects  with  unusual  spectra.  For  this  purpose  we 
detect  subspace  outliers  as  mentioned  in  Section  3.1.  The  assumption  behind  this  choice  is  that 
normal  spectra  lie  in  a  low-dimensional  linear  subspace.  In  other  words,  we  can  find  a  small 
number  of  bases  whose  linear  combinations  can  approximate  normal  spectra.  On  the  contrary, 
anomalous  spectra  contain  unusual  spectral  patterns  that  cannot  be  reconstructed  by  these  bases. 
For  example,  if  a  spectrum  has  an  unusual  emission  line  that  the  normal  bases  do  no  have,  then 
this  spectrum  cannot  be  approximated  by  the  bases  and  thus  will  be  detected.  Many  subspace 
approaches  has  been  proposed  to  study  spectra  in  astronomy.  [37]  provides  a  brief  survey  of 
researches  that  used  PCA  to  accomplish  tasks  including  spectra  classification,  visualization,  and 
physical  property  extraction.  [113,  178,  179]  analyzed  the  quasars,  galaxies,  and  stars  in  SDSS 
using  PCA.  [33]  uses  PCA  to  repair  corrupted  pixels  in  spectra.  These  researches  indicate  that 
low-dimensional  subspace  can  indeed  capture  the  main  characteristics  of  the  SDSS  spectra. 

Subspace  outliers  can  be  detected  by  first  modeling  the  normal  subspace  and  then  finding  points 
outside  of  it.  We  can  use  factorization  algorithms  to  accomplish  this  goal.  In  this  process,  robust¬ 
ness  is  a  desirable  property  of  the  low-rank  model  because  of  the  severe  challenges  presented  by 
outliers,  especially  for  our  automatic  detection  purpose.  First  of  all,  the  survey  observations  usually 
have  limited  quality.  There  are  plenty  of  bad  pixels,  interference  from  the  sky,  and  other  sources  of 
noise  such  as  galaxies  mislabeled  as  stars.  Second,  the  emission  lines  (Figure  1.1b)  in  the  spectra 
may  vary  dramatically  and  cause  problems  to  regular  algorithms.  As  analyzed  by  [163],  emission 
features  form  a  main  source  of  the  inadequacy  of  linear  subspaces  given  by  PCA.  Finally  and  most 
importantly,  in  our  automated  pipeline,  the  potentially  novel  objects  are  hidden  among  all  the  other 
regular  ones.  These  novel  objects  are  usually  also  outliers  that  can  bend  the  model  towards  their 
side  and  thus  make  themselves  harder  to  detect.  Therefore,  the  low-rank  models  need  to  be  robust, 
so  that  when  they  remain  reliable  when  trained  from  a  set  of  “dirty”  spectra  including  potential 
novelties,  bad  pixels,  large  emission  features,  and  other  corruptions. 

To  answer  these  challenges,  we  use  the  DRMF  algorithm  to  find  robust  normal  subspaces. 
DRMF  is  able  to  alleviate  the  impact  of  emission  lines  in  the  spectra  and  obtain  reliable  subspace 
models.  It  is  also  efficient  enough  to  process  the  large  amount  of  data.  We  can  use  the  existing  data 
to  leam  a  reliable  subspace  that  contains  most  of  the  normal  spectra.  Once  we  have  this  subspace 
model,  anomalous  spectra  can  be  found  outside  of  this  subspace. 

In  order  to  find  the  anomalies,  we  propose  to  use  the  following  scoring  function: 

/  D  \  VP 

/p(xj)  =  ||xj  -  X j 1 1 p  =  I  ^2  \Xij  -  xij\P  )  •  (3.13) 

where  x*  is  the  projection  of  x,  in  the  normal  subspace,  or  equivalently  the  low-rank  reconstruction 
of  Xj.  Therefore,  fp ( • )  is  calculating  the  Lp  distance  from  a  point  to  its  projection  in  the  normal 

2http://www.sdss3.org 
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subspace.  The  parameter  p  controls  weather  the  scoring  function  should  focus  on  a  few  badly 
reconstructed  pixels  or  the  mismatch  of  the  overall  shape  of  the  spectrum.  To  apply  the  scoring 
function  (3.13)  with  DRMF,  we  just  have  to  train  a  DRMF  model  on  the  data  and  obtain  the  robust 
low-rank  approximation  L.  Then,  we  can  use  the  zth  row  of  L  as  x,. 

3.7.1  Results  on  SIMBAD  Objects 

In  order  to  test  he  performance  of  our  proposed  detector,  we  looked  up  in  the  SIMBAD3  database  to 
find  stars  that  have  been  assigned  class  labels.  We  test  the  detector’s  ability  to  find  objects  that  are 
labeled  by  SIMBAD.  Even  if  these  SIMBAD-labeled  objects  are  only  a  small  portion  of  the  whole 
data  set,  and  that  they  might  not  correspond  to  the  true  novelties  in  the  whole  data  set  precisely,  we 
trust  that  on  average  these  labeled  objects  are  more  interesting  than  the  rest  of  the  data  set,  and  we 
would  want  the  detector  to  find  them  out. 

Our  data  set  contains  49,  529  stars  from  SDSS.  These  stars  are  selected  to  have  a  high  enough 
signal-to-noise  ratio.  We  normalize  the  stars’  spectra  so  that  each  spectrum  vector  sums  to  1, 
therefore  only  the  shape  of  a  spectrum  matters.  We  found  that  6,454  out  these  stars  have  been 
assigned  a  label  from  one  of  the  42  SIMBAD  classes,  which  are  listed  at  http:  //simbad. 
u-strasbg .  f  r/guide/chF  .  htx.  For  the  ease  of  presentation  and  analysis,  we  collapse 
these  42  classes  into  15  according  to  the  class  hierarchy  specified  by  SIMBAD. 

We  compare  the  performance  of  DRMF  to  PCA  and  RPCA  as  in  Section  3.5.  The  parame¬ 
ters  of  different  algorithms  are  hand  tuned  to  achieve  their  respective  optima.  We  found  that  the 
rank-5  DRMF  combined  with  scoring  function  /io(-)  produces  the  best  results.  This  indicate  that 
the  spectra  indeed  have  a  low  rank,  and  the  scoring  function  should  focus  on  a  few  erroneously 
reconstructed  pixels  instead  of  counting  small  errors  on  all  pixels. 

Table  3.2  shows  the  labeled  the  classes  and  the  APs  of  different  methods  on  detecting  these 
classes.  We  can  see  that  the  DRMF  based  detector  achieved  that  best  result,  and  the  RPCA  based 
detector  is  only  slightly  worse.  Both  robust  methods  are  significantly  between  than  the  plain  PCA, 
showing  the  necessity  of  robust  modeling.  The  achieved  AP  of  56.72  also  shows  that  the  proposed 
detection  scheme  is  very  effective  in  finding  interesting  objects.  Concretely,  about  80  of  the  top 
100,  or  3000  of  the  top  5000,  detection  results  are  interesting  enough  to  be  labeled  by  SIMBAD. 

3.7.2  Collaborative  System 

Anomaly  detection  is  just  the  first  step  in  analyzing  the  astronomical  data.  It  is  vital  for  learning 
systems  to  get  supervision  from  experts  in  the  forms  of  class  labels,  etc.  Therefore,  we  present 
the  detection  results  to  the  astronomers  and  let  them  provide  feedbacks.  After  the  astronomers 
have  examined  and  labeled  these  anomalies,  we  shall  have  the  “seed”  label  information  to  support 
further  learning  tasks  such  as  active  learning  and  classification. 

To  facilitate  this  process.  We  developed  a  real-time  detection  system  as  well  as  a  website  to 
detect  anomalies  from  the  real-time  data  provided  by  SDSS-III  and  present  the  detection  results 
and  to  collect  feedbacks.  The  backend  system  receives  new  data  from  the  SDSS-III  data  source 

3http://simbad.  u-strasbg.  fr/simbad 
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Class 

Description 

Class  Size 

PCA 

RPCA 

DRMF 

All 

All  labeled  objects 

5611 

31.01 

54.39 

56.72 

? 

Unknown  object 

46 

0.80 

1.11 

1.09 

IR 

Infra-Red  source 

35 

0.07 

0.06 

0.06 

LM* 

Low-Mass 

71 

2.96 

0.89 

0.69 

PM* 

High  proper-motion 

85 

1.34 

1.38 

1.33 

UV 

UV-emission  source 

44 

0.11 

1.18 

1.18 

blu 

Blue  object 

140 

0.50 

1.54 

1.73 

CLU 

Cluster 

45 

0.60 

0.96 

0.90 

G 

Galaxy 

316 

7.18 

9.96 

9.83 

CV 

Cataclysmic  Variable 

96 

21.04 

27.46 

26.50 

PEC 

Peculiar  star 

800 

2.94 

9.70 

9.52 

NEB 

Nebula 

24 

59.25 

75.79 

75.66 

V 

Variable  star 

172 

0.59 

0.54 

0.70 

WD 

White  dwarf 

3617 

25.69 

52.48 

56.46 

RAD 

Radio  source 

105 

1.27 

1.17 

1.13 

X 

X-ray  source 

15 

0.07 

0.34 

0.32 

Table  3.2:  AP  of  detecting  SIMBAD  objects  using  the  normalized  spectrum  feature. 


daily.  It  then  update  the  robust  low-rank  model  and  detection  results  to  reflect  the  recent  change. 
Various  features  and  detection  methods  are  implemented  for  diversity  and  evaluation  purposes. 

A  snapshot  of  the  website4  is  shown  in  Figure  3.9.  The  goal  of  the  website  is  to  easy  communi¬ 
cation  and  collaboration.  Through  it,  we  can  inform  the  astronomers  of  the  latest  detection  results 
and  they  can  give  us  feedbacks  on  what  these  results  means  and  how  good  they  are.  The  users  is 
able  to  select  detection  results  based  on  different  object  types,  features,  and  detection  algorithms. 
Each  detected  object  has  a  block  showing  the  essential  information  and  useful  links  for  easy  la¬ 
beling.  Comments  and  feedbacks  can  be  collected  from  multiple  users  to  enable  discussion  and 
collaboration.  We  also  provide  other  functionalities  such  as  finding  look-alike  objects  based  on 
spectrum  similarity,  as  shown  in  Figure  3.10.  In  the  future,  this  website  can  be  enhanced  to  facil¬ 
itate  other  learning  tasks  such  as  classification,  clustering,  finding  group  anomalies  (See  Chapter 
4),  and  so  on. 


4Currently  hosted  at  http :  / / www .  autonlab  .  org/ sdss 
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INPUT 


FG-DRMF  BG-RPCA 


FG-RPCA 


BG-DRMF 


(a)  Hall 


(b)  Lobby 


(c)  Restaurant 


(d)  Shopping  Mall 


Figure  3.8:  Video  activity  detection  result  frames.  In  each  sub-figure,  the  images  from  left  to  right 
are:  the  original  frame,  background  and  foreground  from  DRMF,  background  and  foreground  from 


RPCA. 
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Figure  3.9:  The  frontpage  of  the  SDSS  collaborative  SDSS  website. 
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STAR  -  BOIVe,  SIMBAD.  NED 
5325-55980-0454  Score:  8.633e+4  (#3) 
Obsen-ed:  223/12,  Processed:  11/28/12  00 
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Figure  3.10:  The  UI  for  finding  similar  objects  in  the  SDSS  collaborative  SDSS  website. 
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Part  II 

Learning  from  Multidimensional  Data 
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Chapter  4 

Generative  Models  for  Collective  Data 


From  now  on,  we  consider  groups  of  real- valued,  multi-dimensional  points.  For  these  groups,  there 
is  no  easy  way  to  reduce  them  into  vectorial  representations.  In  this  chapter,  we  describe  parametric 
generative  models  to  directly  capture  the  generating  process  of  the  groups.  These  models  can 
then  facilitate  us  to  do  classification,  clustering,  anomaly  detection,  and  so  on.  In  the  following, 
however,  our  models  will  be  motivated  by  the  group  anomaly  detection  problem. 


4.1  Introduction 

Given  a  data  set,  anomaly/novelty  detection  aims  at  finding  things  that  “surprise”  us.  These  things 
can  either  interfere  with  the  learning  process,  in  which  case  they  should  be  removed,  or  they 
may  have  values  for  being  novel.  Section  3.1  gave  a  brief  introduction  on  anomaly  detection. 
Traditional  anomaly  detection  typically  focuses  on  finding  individual  point  anomalies.  But  often 
the  most  interesting  or  unusual  things  in  a  data  set  are  not  odd  individual  points,  but  rather  larger 
scale  phenomena  that  only  become  apparent  when  groups  of  points  are  considered.  We  call  these 
unusual  groups  the  group  anomalies. 

Group  anomalies  exist  in  many  real-world  problems.  For  example,  as  mentioned  in  Chapter  1, 
astronomy  surveys  such  as  the  Sloan  Digital  Sky  Survey  (SDSS)  produce  descriptions  for  a  vast 
amount  of  celestial  objects.  We  not  only  want  to  pick  out  the  scientifically  valuable  objects  like 
planetary  nebulae,  but  also  special  clusters  of  galaxies  that  could  shed  light  on  the  development  of 
the  universe  [166].  Also  in  the  particle  simulation  systems  in  physics,  a  single  particle  is  seldom 
interesting,  but  a  group  of  particles  can  exhibit  interesting  motion  patterns  like  the  interweaving 
vortices.  In  computer  vision,  an  unusual  image  can  be  a  strange  group  of  local  patches,  and  an 
anomalous  video  sequence  can  be  thought  of  as  an  odd  group  of  image  frames.  Other  examples 
are  abundant  in  the  fields  of  text  processing,  time  series,  and  spatial  data  analysis. 

Two  types  of  group  anomalies  are  considered.  A  point-based  group  anomaly  is  a  group  of 
individually  anomalous  points.  A  distribution-based  anomaly  is  a  group  where  the  points  are 
relatively  normal,  but  as  a  whole  they  are  unusual.  Most  existing  work  on  group  anomaly  detection 
focuses  on  point-based  anomalies.  A  common  way  to  detect  point-based  anomalies  is  to  first 
identify  anomalous  points  and  then  find  their  aggregations  using  scanning  or  segmentation  methods 
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[38,  39,  68].  This  paradigm  clearly  does  not  work  well  for  distribution-based  anomalies,  where  the 
individual  points  are  normal.  To  handle  distribution-based  anomalies,  we  can  design  features  for 
groups  and  then  treat  them  as  points  [25,  81].  However,  this  approach  relies  on  feature  engineering 
that  is  domain  specific  and  can  be  difficult. 

We  take  a  generative  approach  to  address  the  group  anomaly  detection  problem.  If  we  have  a 
probabilistic  model  that  generates  the  normal  data,  then  we  can  mark  the  groups  that  have  small 
probabilities  under  this  model  as  anomalies.  The  “bag-of-points”  assumption  is  made,  i.e.  ,  points 
in  the  same  group  are  unordered  and  infinitely  exchangeable .  Under  this  assumption,  mixture 
models  are  often  used  to  model  the  data  due  to  De  FinettVs  theorem  [40].  The  most  famous  class 
of  generative  models  for  modeling  group  data  is  the  family  of  topic  models  [14,  72].  In  topic 
models,  distributions  of  points  in  different  groups  are  mixtures  of  simple  components  called  the 
“topics”,  which  are  shared  among  all  the  groups. 

We  propose  the  genre  models  based  on  topic  models.  Genre  models  are  specifically  designed 
for  the  purposes  of  detailed  characterization  of  the  groups  and  detecting  distribution-based  group 
anomalies.  Flexible  probabilistic  structures  based  on  the  mixture  of  “genres”  is  employed  to  de¬ 
scribe  how  the  topic  weights  are  generated  for  each  group  so  that  complex  normal  behaviors  can  be 
modeled.  These  genres  capture  the  high-level  distributional  behavior  of  the  groups,  and  therefore 
are  ideal  for  detecting  distribution-based  anomalies. 

In  order  to  achieve  more  precise  modeling  of  the  groups,  we  further  add  the  flexibility  to 
allow  each  group  to  have  their  own  topics  in  order  to  accommodate  the  variations  of  the  points’ 
distributions  in  different  groups.  Meanwhile,  information  is  still  shared  among  groups  via  a  global 
mechanism  called  the  “topic  generators”  to  help  estimate  the  topics.  Topic  generators  can  also 
capture  the  behavior  of  the  topics  and  detect  the  presence  of  unusual  topics  that  cause  point-based 
anomalies. 

Having  the  genre  models,  we  can  examine  if  a  test  group  conforms  to  the  normal  behavior 
defined  by  the  learned  genres  and  topic  generators.  We  show  that  straightforward  scoring  func¬ 
tions  have  their  limitations  and  may  lead  to  unstable  results.  Instead,  several  specifically  designed 
scoring  functions  are  used  to  detect  both  the  point-based  and  distribution-based  group  anomalies. 
These  scoring  function  are  also  developed  to  be  robust  against  noise,  and  be  able  to  ameliorate  the 
weaknesses  of  the  simpler  models  to  achieve  a  good  balance  between  the  speed  and  flexibility. 

Exact  inference  and  learning  for  the  genre  models  are  generally  intractable,  so  we  resort  to 
approximate  methods.  Both  variational  inference  or  Gibbs  sampling  [58]  methods  are  developed 
to  leam  the  genre  models.  We  test  the  performance  of  the  genre  models  along  with  the  scoring 
functions  on  both  synthetic  and  on  real-world  data  sets  including  scene  images,  astronomical  sur¬ 
veys,  and  turbulence  simulations.  Empirical  results  show  that  the  proposed  methods  are  effective 
in  modeling  collective  data  and  finding  group  anomalies. 

The  chapter  is  structured  as  follows.  We  introduce  some  background  and  define  the  problem 
set-up  Section  4.2.  In  Section  4.3  we  describe  some  related  work.  The  proposed  models  and 
scoring  functions  are  described  in  Section  4.4,  4.5,  and  4.6.  In  Section  4.7,  we  then  make  de¬ 
tailed  discussion  about  the  models  and  the  scoring  functions.  Experimental  results  are  shown  in 
Section  4.8.  We  finish  with  a  short  discussion  and  conclusions  (Section  4.9). 
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4.2  Background 


In  this  section,  we  provide  background  about  topic  models  and  define  our  group  anomaly  detection 
problem.  For  intuition,  we  introduce  the  problem  in  the  context  of  detecting  anomalous  images, 
rare  galaxy  clusters,  and  unusual  motion  in  a  dynamic  fluid  simulation,  but  the  methods  can  be 
used  for  other  collective  data. 

We  consider  a  data  set  with  M  groups  G±, . . . ,  GM  (e.g.  spatial  clusters  of  galaxies,  patches  in 
an  image,  or  fluid  motions  in  a  local  region).  Group  Grn  contains  Nrn  points  (galaxies,  image  patch¬ 
es,  simulation  grid  points)  as  Gm  =  (xmji, . . . ,  ,  xn  e  RD,  where  D  is  the  dimensionality 

of  the  points.  We  further  assume  that  points  in  the  same  group  are  unordered  and  exchangeable. 

Topic  models  such  as  the  latent  dirichlet  allocation  (LDA)  [14]  are  widely  used  to  model  data 
having  this  kind  of  group  structure.  The  original  LDA  model  was  proposed  for  text  processing.  It 
represents  the  distribution  of  points  (words)  in  a  group  (document)  as  a  mixture  of  K  global  topics 
p(x;  /5i), . . .  ,p(x;  /3K),  where  6k  is  the  parameter  of  the  kth  topic.  When  the  points  are  discrete 
(e.g.  words  in  text  documents),  p(x;  Bk)  can  be  the  multinomial  distribution  A4(Bk)  with  Bt  e  S°, 
where  SD  is  the  D-dimensional  probability  simplex.  Let  A4(0)  be  the  multinomial  distribution 
parameterized  by  6  e  E>K  and  Vir(a)  be  the  prior  Dirichlet  distribution  with  parameter  a  €  . 

LDA  generates  the  mth  group  by  first  drawing  its  topic  weight  0rn  from  the  prior  distribution 
Vir(a).  Then  for  each  point  xrnn  it  draws  one  of  the  K  topics  from  M.{6m)  (i.e.  ,  zrnn  ~  A4(0rn)) 
and  then  generates  the  point  according  to  this  topic  ( xmn  ~  3A(3Zrnn)).  A  description  of  the  LDA 
model  can  also  be  found  in  Figure  1.  Figure  4.1  shows  the  graphical  model  of  LDA. 


Figure  4.1:  Graphical  model  for  latent  Dirichlet  allocation  (LDA). 

Essentially,  topic  models  capture  each  group  by  a  mixture  model.  The  key  to  successful  model¬ 
ing  is  that  the  topics  (mixture  components)  {p(x;  Bk)} k=]  K  are  shared  among  all  groups,  there¬ 
fore  all  the  information  are  used  to  learn  them.  The  shared  topics  also  provide  a  common  basis 
to  compare  the  groups.  Another  important  ingredient  is  that  the  mixing  weights  are  governed  by 
the  global  distribution  p(9;a )  =  Vir(a),  which  is  use  to  convey  prior  information  to  help  the 
estimation  of  the  topics  and  the  topic  weights  in  each  group. 

The  topic  models  can  also  help  us  model  groups  of  real- valued,  multidimensional  points  with  a 
slight  change.  In  our  examples,  we  want  the  topics  to  represent  concepts  such  as  the  galaxy  types 
{e.g.  “blue”, “red”,  or  “emissive”,  with  K  =  3  topics),  objects  in  the  images,  or  common  motion 
patterns  in  the  fluid  (go  left,  go  right,  etc),  each  of  which  can  be  captured  by  a  distribution  of  the 
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points.  To  do  this,  we  can  choose  to  model  the  topics  by  Gaussian  distributions  ( i.e .  p(x;  (3k)  = 
JV(/3k)  where  (3k  contains  both  the  mean  vector  and  the  covariance  matrix),  so  that  each  point  is 
generated  from  one  of  the  K  Gaussian  distributions. 

Now  we  ask  the  question  whether  the  distribution  of  points  in  group  Grn  is  normal.  At  a  higher 
level,  a  group  is  characterized  by  the  topic  weight  9m,  i.e.  ,  the  proportion  of  different  topics  in 
the  group  Gm.  At  a  lower  level,  we  should  also  look  at  how  the  actual  points  are  generated  from 
the  topics.  This  two-level  characterization  can  help  us  define  the  group  anomalies:  a  point-based 
group  anomaly  contains  points  that  are  unlikely  to  be  from  any  of  the  topics,  and  a  distribution- 
based  group  anomaly  has  a  topic  weight  0rn  that  is  anomalous.  For  example,  when  detecting  group 
anomalies  the  astronomy  data,  we  are  looking  for  galaxy  clusters  containing  galaxies  that  do  not 
fall  into  the  common  types  (red,  blue,  and  emissive),  or  clusters  in  which  the  proportion  of  different 
types  of  galaxy  is  strange. 

Although  topic  models  are  very  useful  in  estimating  the  topics  and  topic  weights  in  the  groups, 
the  existing  methods  are  incapable  of  detecting  group  anomalies  comprehensively .  In  order  to 
detect  anomalies,  the  model  should  be  flexible  enough  to  capture  complex  normal  behaviors.  For 
example,  it  should  be  able  to  model  complex  and  multi-modal  distributions  of  the  topic  weight 
6.  LDA,  however,  only  uses  a  single  Dirichlet  distribution  to  generate  topic  weights,  and  cannot 
define  what  is  the  normal  and  what  is  not  with  precision.  It  also  uses  the  same  K  topics  for  all 
groups,  which  might  makes  groups  indifferentiable  when  looking  at  their  topics.  Moreover,  these 
shared  topics  are  not  adapted  to  each  group  either. 

The  genre  models  and  the  their  corresponding  scoring  functions  are  developed  to  address  these 
problems.  Based  on  latent  Dirichlet  allocation  (LDA)  [14],  we  progressively  propose  three  proba¬ 
bilistic  hierarchical  models  designed  specifically  for  the  purpose  of  detecting  group  anomalies.  The 
first  model  is  simple  and  focuses  on  distribution-based  anomalies  and  enriches  LDA  by  allowing  a 
flexible  way  of  generating  topic  weights.  The  second  model  further  takes  both  distribution-based 
and  point-based  anomalies  into  account,  forming  a  very  elastic  topic  model  and  a  comprehensive 
anomaly  detector.  The  third  one  finds  the  trade-off  between  model  flexility  and  learning  speed  to 
make  the  most  practical  model. 


4.3  Related  Work 

Typically,  the  notion  of  “anomaly”  depends  heavily  on  the  specific  problem,  and  various  algorithms 
have  been  developed  for  their  own  purposes.  Quite  often  they  are  based  only  on  the  simple  idea 
that  a  data  point  is  anomalous  if  it  falls  in  a  low  density  region  of  the  feature  space.  For  example, 
[182]  uses  the  distances  to  nearest  neighbors  as  an  anomaly  score.  [21]  consider  the  case  of  non- 
uniform  density  of  the  normal  data,  and  propose  a  local  outlier  factor  for  detecting  anomalous 
instances.  We  can  also  explicitly  estimate  the  underlying  density  function  and  use  statistical  tests 
to  find  anomalies.  To  see  a  comprehensive  summary,  readers  can  refer  to  the  survey  by  [26]. 

Detecting  group  anomalies  is  not  a  new  problem,  but  only  a  few  results  have  been  published 
on  it.  One  idea  is  to  represent  each  group  as  a  point,  and  then  apply  point  anomaly  detectors  for 
these  groups.  To  do  this,  we  need  to  define  a  feature  vector  for  the  groups  [25,  81].  A  problem 
with  this  approach  is  that  it  relies  heavily  on  feature  engineering,  which  can  be  domain  specific 
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and  difficult.  We  believe  that  directly  modeling  the  generative  process  of  the  data  is  more  natural, 
and  can  help  us  explore  the  data  sets. 

Another  approach  is  to  first  identify  the  individual  anomaly  points,  and  then  try  to  find  aggre¬ 
gations  of  these  points.  Scan  and  segmentation  methods  are  often  used  for  this  purpose.  On  image 
data,  [68]  applied  a  point  anomaly  detector  to  find  anomalous  pixels,  and  then  segment  the  image 
to  find  the  anomalous  group  of  pixels.  [38]  first  detects  interesting  points,  and  then  find  subsets 
of  the  data  with  a  high  ratio  of  anomalous  points.  [39]  proposed  a  scan  statistic-based  method  to 
find  anomalous  subsets  of  points.  In  these  approaches  the  anomalousness  of  a  group  is  determined 
by  the  anomalousness  of  its  member  points,  therefore  they  cannot  find  anomalous  groups  that  are 
unusual  only  at  the  group  level  i.e.  the  distribution-based  anomalies. 

The  proposed  genre  models  belongs  to  the  family  of  topic  models.  The  goal  of  traditional  topic 
models  is  to  estimate  the  topics  and  the  topic  weights  in  each  group,  and  the  model  parameters  are 
used  to  facilitate  the  estimations.  They  lack  the  ability  to  do  group  anomaly  detection  because  we 
need  models  that  capture  the  details  of  how  the  groups  are  generated,  so  that  they  can  differen¬ 
tiate  unusual  behaviors  from  the  normality.  Many  enhanced  topic  models  have  been  proposed  to 
increase  the  modeling  power.  [13]  and  [102]  enhanced  the  prior  distribution  of  the  topic  weights  to 
model  the  correlations  between  topics.  [80]  and  [50]  use  mixture  models  to  generate  topic  weights 
to  do  clustering  on  the  groups.  These  ideas  are  useful  for  modeling  group-level  behaviors  but  fails 
to  capture  anomalous  point  behaviors.  On  the  other  hand,  [45]  proposed  to  use  different  topics 
for  different  groups  in  order  to  account  for  the  burstiness  of  the  points.  These  adaptive  topics  are 
useful  in  recognizing  point-level  anomalies,  but  cannot  be  used  to  detect  anomalous  behavior  at  the 
group  level.  Finally,  it  was  unclear  how  topic  modeling  should  be  used  to  find  group  anomalies.  To 
address  the  above  problems,  we  use  ingredients  from  the  topic  modeling  research  and  propose  new 
models  to  characterize  groups  both  at  the  group-level  and  the  point-level.  Corresponding  scoring 
functions  are  also  designed  to  be  sensitive  to  anomalies  but  also  robust  against  insignificant  noise. 
We  demonstrate  that  they  are  able  to  solve  the  issues  above  and  performs  better  than  the  existing 
algorithms. 


4.4  Multinomial  Genre  Models 

We  extend  LDA  to  address  several  of  its  weakness  in  detailed  modeling  of  complex  collective 
data.  To  address  the  problem  of  simplistic  distribution  of  topic  weights  in  LDA,  we  introduce 
the  concept  “genres”  to  characterize  the  topic  weights  so  that  complex  normal  behaviors  can  be 
recognized.  A  genre,  intuitively,  is  a  type  of  typical/normal  topic  weight,  and  we  allow  a  group 
to  derive  its  topic  weight  from  one  of  the  many  genres.  The  combination  of  these  genres  is  very 
flexible  and  thus  can  accurately  describe  what  normal  topic  weights  should  be  like  in  the  data  set. 
In  the  next  sections,  the  assumption  that  topics  are  shared  globally  in  topic  models  will  also  be 
relaxed  to  further  enhance  the  modeling  power. 

To  start,  we  let  the  genres  be  the  typical  topic  weights  themselves.  In  other  words,  we  construct 
a  dictionary  of  typical  topic  weights  (i.e.  multinomial  distributions),  and  each  group  can  select  one 
of  them  as  its  own  topic  weight.  We  call  this  model  the  Multinomial  Genre  Model  (MGM). 

We  assume  that  there  are  K  topics  {p(x;  (3k)}k=1  K,  and  the  points  are  generated  from  one  of 
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the  K  Gaussian  topics  as  p(x;  (3k)  =  {A f  (///,.,  Efc)},  where  f3k  =  ///,,,  Efc  is  the  mean  and  covariance 
of  the  Gaussian.  But  we  shall  still  use  the  general  notation  to  cope  with  other  types  of  topics.  Also 
let  the  tth  genre  be  at  G  E>K  denoting  a  typical  topic  weight  vector,  and  a  =  {an, . . . ,  «t}^i 
denote  the  set  of  T  genres,  p  G  §T  is  a  distribution  over  the  genres  (weights  of  the  genres).  The 
generative  process  of  MGM  is  described  in  Algorithm  4,  and  the  corresponding  graphical  model 
is  shown  in  Figure  4.2. 


Algorithm  4  The  generative  process  of  MGM. 
for  groups  m  =  1  to  M  do 

•  Choose  a  genre  ym  G  {1, . . . ,  T}  ,  ym  ~  M(p).  Let  the  topic  weight  9m  =  ayrn  G  E>K . 

for  n  =  1  to  Nm  do 

•  Choose  a  topic  zmn  G  {1, . . . ,  K},  zmn  ~  M(9m). 

•  Generate  a  point  xmn  G  RD,  xmn  ~  P(xmn|/3,  zmn ). 


Our  strategy  for  group  anomaly  detection  is  as  follows.  Using  the  training  set,  we  first  learn 
the  model  parameters  0  =  {p,  a,  /?}.  If  a  test  group  G  is  not  compatible  with  our  model,  then  it 
will  lead  to  a  small  likelihood  P(G|0)  compared  to  normal  groups  like  those  in  the  training  data. 
Hence  we  can  detect  it  as  an  anomalous  group. 

Under  MGM,  the  complete  and  marginal  likelihood  of  group  Gm  are 

Nm 

P  (l/m;  ^mi  f-On|0)  Pi^mn \^mni  P'ji  (4.1) 

n=  1 

T  Nm  K 

P(Gm  I©)  =  E^IIE  atkP(Xmn\/3k)-  (4.2) 

t= 1  n= 1  k= 1 

To  leam  the  parameters  using  maximum  likelihood  estimation,  we  want 

M 

0  =  arg  max  log  P(Gm\p,  a,  (3). 

P’a’P  m= 1 
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Unlike  the  LDA  model,  direct  maximization  of  the  likelihood  function  is  possible.  We  can  use  the 
Expectation-Maximization  (EM)  method.  However  in  practice  we  found  that  the  variational  EM 
[77]  method  was  able  to  learn  high  quality  models  faster  than  EM.  Therefore  in  the  following  we 
shall  describe  the  variational  method. 

4.4.1  Inference  and  Learning 

According  to  the  Jensen’s  inequality,  for  any  variational  distribution  qm(y ,  z )  we  have  that 

log P(Gm\@)  >  [ d(y>  z)(im{y , z )  log  z,( 

=  [!°g  p(y>z,Gm\Q)}  -  E qm  [log qm(y,  z)\ ,  (4.3) 


with  equality  iff  qm(y,  z )  =  P(y,  z\Gm,  0),  and  E,;  [•]  denotes  the  expected  value  w.r.t.  the  distri¬ 
bution  q.  The  posterior  distribution  P{Gm |0)  might  difficult  to  compute,  thus  instead  of  directly 
attacking  of  log  P(Gm | ©),  we  will  maximize  its  lower  bound  as 

0  =  arg  max  ^  E9m  [log  P(y,z,Gm  1 0)]  -  Ef/m  [log  qm\  ,  (4.4) 

©>{9m}  m 

where  we  look  for  the  variational  distributions  qm  in  the  parametric  form: 

Nm 

q{ym,  Zmllmi  (j>m)  =  q{y  m  |07n)  n  qi^mnl^  mn  )•  (4.5) 

n=  1 


Here  G  STand0mn  G  §A  are  the  variational  parameters,  and  q(ym\ym)  =  M(jm),q(zmn \<j>mn)  = 
J\A(ornn)  are  multinomial  distributions.  Combining  Eq.  (4.1), (4. 4)  and  (4.5),  we  have  that  the  vari¬ 
ational  learning  problem 

M 

0  =  arg  max  ^  Lm  (ym,  <j>m,  0) ,  (4.6) 

{7m}, rn=l 

where  Lm  is 


^m(7m5  4>m)  Pi  &i  P')  E q  [1©§  P{ymi  ^mi  ^m\pi  &i  /^)]  Eg  [log  Q^Umi  ^m)]  (4-7) 

Nm 

=  Ey  [log  P(ym\p)}  +  Ey  [log  P(zmn\ym,  a)] 

n=  1 

Nm 

+  57  Eg  [log  p(xmn\zmn,  /3)\  -  Eq  [log  q(ym\lm)\ 

n=l 

Nm 

^  ^  Eg  [log  (j  (^mn|0mn)] . 

n= 1 
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We  omit  the  derivation  and  show  that  each  of  the  solutions  below  maximizes  Lm  when  the 
other  variables  are  fixed: 


4>mnk  °C  exp  X  Tmi  ®-tk  4"  log  P(xmn  |/3fc) 


,t= 1 


Nm  K 


Imt 


atk  (X 


exp  log  pt  +  EX  (prank  log 


Oitk 


n= 1  fc=l 


M  N„ 


^  '  Tmt  ^  ^  (pm,n,k ■ 


(4.8) 

(4.9) 
(4.10) 


m= 1  n=l 

Note  that  the  multinomial  parameters  need  to  be  normalized  to  sum  to  one.  Finally,  to  calculate 
{/3k},  we  need  to  solve 


M  Nm  k 

/3k  =  argmax  EEE  (prank  log  P(xmn  |/3fc)  •  (4.11) 

m=  1  n=  1  fc=l 

Specially,  when  P(xrnn|/5fc)  =  Af(xmn|///,,  Efc),  then  learning  {///,,  Xfc}  is  the  same  as  fitting  Gaus- 
sians  in  a  mixture  of  Gaussians  model  with  <p  being  the  weights  of  the  samples  [114]. 

In  inference,  we  seek  for  the  variational  posterior  distributions  g(y)  and  q{(p).  We  can  fix  the 
value  of  the  parameters  p,ot,  (3  and  update  the  values  of  7,  (p  using  Eq.  (4.9)  and  (4.8)  iteratively 
until  convergence.  When  learning  the  MGM  model,  we  iteratively  update  all  the  model  parameters 
together  with  the  variational  parameters  until  convergence. 

In  order  to  use  the  MGM  model  we  need  to  determine  T  the  number  of  genres  and  K  the  num¬ 
ber  of  topics.  To  automatically  determine  their  values,  we  can  either  use  model  scoring  methods 
such  as  BIC  [149],  or  AIC  [3],  or  we  can  resort  to  cross-validation  to  find  the  best  parameter  val¬ 
ues  that  can  maximize  the  specific  learning  performances.  The  definition  of  BIC  score  is  given  by 
BIC(X,  0)  =  In  L(X,  0)  —  f  In ( | ©  | ) ,  where  |  •  |  stands  for  the  number  of  free  parameters.  Sim¬ 
ilarly,  the  AIC  score  is  given  by  AIC(X.  0)  =  In  L(X,  0)  —  |@|.  We  can  then  use  these  criteria 
to  search  for  the  best  T  and  K  values.  In  practice,  we  first  determine  K  using  T  —  1,  and  then 
determine  T  fixing  K . 

4.4.2  Scoring  Functions 

In  this  section  we  discuss  how  to  define  scoring  functions  that  can  detect  group  anomalies  based 
on  MGM.  Having  learned  the  parameters  0,  a  natural  choice  is  to  score  a  group  by  its  likelihood 
—  In  P(G'|0).  In  theory,  this  likelihood  score  is  able  to  find  anomalous  groups  that  either  contain 
anomalous  points  or  have  strange  topic  weights.  However,  directly  using  (4.2)  may  produce  dubi¬ 
ous  results.  In  fact,  the  likelihood  is  problematic  even  when  used  to  find  point-based  anomalies. 
First,  if  a  group  only  contains  points  from  the  centers  of  the  topics,  then  it  would  receive  a  low 
score,  even  if  such  a  behavior  never  appeared  in  the  training  data.  Second,  if  there  is  a  single  point 
not  belonging  the  any  of  the  topics,  then  the  score  of  the  whole  group  will  be  inflated  to  infinity.  It 
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is  debatable  that  this  behavior  is  correct,  but  we  argue  that  such  anomalies  can  easily  be  found  by 
other  much  simpler  methods,  and  will  overshadow  the  truly  anomalous  collective  behaviors. 

To  find  the  distribution-based  anomalies,  we  propose  to  score  only  the  topic  weights  in  each 
group:  we  first  infer  the  posterior  distributions  of  the  topics  given  the  data,  and  then  compute  the 
expected  likelihood  of  the  topic  weights.  Unlike  LDA,  MGM  does  not  give  each  group  a  topic 
weight  variable  9m,  so  we  use  the  collection  of  topic  variables  zm  =  {zrnAj . . . ,  zrr^Nm  }  instead. 
Formally,  for  the  MGM  model  the  distribution-based  score  Xd(Gm )  is  defined  as 

Xd(Gm )  =  EZm  [-logP(zm|0)]  =  -  ^P(zm|0,Gm)logP(zm|0),  (4.12) 

Zm 

P(zm|0)  =  hm;at) 

t. 

where  hm  is  obtained  by  aggregating  the  values  in  zm  in  to  a  histogram.  This  score  finds  groups 
whose  topic  variables  zm  are  not  compatible  with  any  of  the  genres  (stereotypical  topic  weights) 
in  a  learned  by  MGM. 

To  simplify  the  computation,  we  use  the  variational  distributions  qm(zm |0m)  to  approximate 
the  corresponding  posterior  distributions  P(zm\Q,  Gm)  in  (4.12).  The  integrations  then  can  be 
done  by  Monte  Carlo  method  using  samples  drawn  from  the  approximate  posteriors.  Similarly,  the 
point-based  score  Xp  can  also  be  approximated  by  the  variational  lower-bound. 

4.5  Flexible  Genre  Models 

In  the  previous  section,  MGM  enhances  the  LDA  model  by  allowing  groups  to  select  their  topic 
weights  from  the  dictionary  of  genres.  With  enough  number  of  genres,  complex  normal  behaviors 
of  the  topic  weights  can  be  captured,  and  topic  weights  that  deviate  from  the  learned  genres  will 
be  marked  as  anomalies. 

Although  MGM  addressed  some  flexibility  issues  of  LDA,  it  is  still  inadequate  for  group 
anomaly  detection.  By  modeling  the  genres  by  a  dictionary  of  multinomial  distributions,  MGM 
does  not  take  the  uncertainty  of  topic  weights  into  account.  MGM  also  inherits  the  assumption  that 
the  topics  are  shared  globally,  therefore  it  cannot  capture  the  point  level  behaviors.  In  this  section 
we  further  improve  our  model  to  form  a  comprehensive  model  for  detecting  both  distribution-based 
and  point-based  anomalies. 

At  the  group  level,  “genres”  are  still  used  to  model  the  topic  distributions.  Instead  of  multi¬ 
nomials,  we  use  one  Dirichlet  distributions  for  each  genre  to  model  a  typical  distribution  of  topic 
weights.  At  the  point  level,  each  group  has  its  own  topics  to  accommodate  the  variations  of  its 
points,  and  these  topics  are  generated  by  the  global  topic  generator.  We  call  this  model  the  Flex¬ 
ible  Genre  Model  (FGM).  Given  a  group  of  points,  we  can  examine  whether  or  not  it  conforms 
to  the  normal  behavior  defined  by  the  learned  genres  and  topics;  A  point-based  anomaly  contains 
points  from  unusual  topics  are  unlikely  given  the  normal  topic  generators,  while  a  distribution- 
based  anomaly  has  a  unusual  topic  weight  9m  given  the  normal  genres. 

The  generative  process  of  FGM  is  presented  in  Algorithm  5.  A  graphical  representation  of  FG- 
M  is  given  in  Figure  4.3.  We  let  A4(p)  be  distribution  of  genres.  Each  genre  is  a  Dirichlet  distribu- 
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Algorithm  5  The  generative  process  of  FGM. 

for  groups  m  —  1  to  M  do 

•  Choose  a  genre  ym  G  {1  ~  M(p). 

•  Choose  a  topic  weight  from  the  genre  ym:  9rn  G  E>K ,  9m  ~  Vir(aym). 

•  Choose  K  topics  ~  P(/3m,k\Vk)}k=i,...,K- 
for  points  n  —  1  to  Nm  do 

•  Choose  a  topic  zmn  G  {1, . . . ,  K },  ~  M(9m). 

•  Generate  a  vector  xmn  ~  P(xmn\/3mtZrnri). 


Figure  4.3:  The  Flexible  Genre  Model  (FGM). 

tion  P(9\at)  for  generating  the  topic  weights  9m,  and  a  =  {at}t=i,...,T  is  the  set  of  genre  parame¬ 
ters.  Each  group  has  K  topics  dm  =  {/3m,k}k=i,...,K-  The  topic  generators,  {P(/3k\rik)}k=i,...,K,  are 
the  global  distributions  for  generating  the  topics  for  each  group.  Having  the  topic  distribution  6m 
and  the  topics  {/3m,k},  points  are  generated  as  in  LDA. 

By  comparing  FGM  to  LDA,  we  can  observe  that:  (i)  in  FGM,  each  group  has  a  latent  genre 
ym,  which  determines  how  its  topic  weight  should  look  like  (Dir(otym)),  and  (ii)  each  group  has  its 
own  topics  {/ 3m,k}k=i,...,K ,  but  they  are  still  tied  through  the  generators  P(/3 1?/).  Thus,  the  topics 
can  adapted  to  local  group  data,  but  information  is  still  shared  globally  to  enhance  estimation 
results.  Moreover,  the  topic  generators  P{j3\rf)  determine  how  the  topics  {/3m.,k}  should  look  like. 
If  a  group  uses  unusual  topics  to  generate  its  points,  it  can  be  identified. 

For  computational  convenience,  the  topic  generators  are  chosen  to  be  Gaussian-lnverse-Wishart 
(GIW)  distributions  parameterized  by  r]k  =  {//o/.-,  u(jk.  To/,.,  uok}  [57].  The  GIW  distribution  are 
conjugate  to  the  Gaussian  topics.  Let  0  =  {p,  cc,  r/}  denote  the  model  parameters.  The  complete 
likelihood  of  data  and  latent  variables  in  group  Grn  under  LGM  is: 

P(Gm,ym,9m,zm,f3m\Q)  (4.13) 

=  M{ym\p)Vir{9m\otym)  Y[GIW(/3m,k\r]k)  JJ M(zmn\9m)J\f(xmn\pmtZmn). 

k  n 

By  integrating  out  9m,  /3m  and  summing  out  yrn.  z,  we  get  the  marginal  likelihood  of  Grn  : 

P(Gm |0)  =  Y,Pt  f  Vir(9m\at)  J]G/^(j3m,fc|%)  nE  ^mfcTV”  {^i-mn\/3m,k^)d/3md9m. 
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4.5.1  Inference  and  Learning 

The  parameters  of  FGM  can  be  learned  via  the  maximum-likelihood  method.  The  inferred  val¬ 
ues  for  the  latent  variables  9rn ,  (3m  can  be  used  for  detecting  anomalies  and  exploring  the  data. 
Nonetheless,  the  inference  and  learning  under  FGM  are  intractable,  so  we  develop  approximate 
method  described  below. 

Inference  Similar  to  the  MGM  model,  approximate  inference  in  FGM  can  also  be  done  via 
variational  EM.  Yet  due  to  the  use  of  the  GIW  distributions,  the  formulae  become  very  complicated. 
Alternatively,  we  can  use  Gibbs  sampling  [58]  to  leam  FGM.  In  Gibbs  sampling,  we  iteratively 
update  one  variable  at  a  time  by  drawing  samples  from  its  conditional  distribution  when  all  the 
other  parameters  are  fixed.  Thanks  to  the  use  of  conjugate  distributions,  Gibbs  sampling  in  FGM 
is  simple  and  easy  to  implement.  The  sampling  distributions  of  the  latent  variables  in  group  m 
are  given  below.  We  use  P(- |  •  •  • )  to  denote  the  distribution  of  one  variable  conditioned  on  all  the 
others. 

For  the  genre  membership  ym  we  have  that: 

P{ym  =  t\  ■  ■  ■ )  oc  P(6m\at)P{ym  =  t\p )  =  ptVir(9rn\at).  (4.15) 

For  the  topic  distribution  9m: 

P(9m |  •  •  • )  oc  P(zm\9m)P(9m\a,  ym )  =  M{zm\9m)Vir(9m\ayrn)  =  Vir(aym  +  hm),  (4.16) 

where  hm  denotes  the  histogram  of  the  K  values  in  vector  zm.  The  last  equation  follows  from  the 
Dirichlet-Multinomial  conjugacy. 

For  the  fcth  topic  in  group  m,  one  can  find  that: 

P(Pm,k I  ••■)<*  P(Gmk\Pm,k)P(Pm,k\Vk)  =  ^  {Gmk\Pm,k)GIW (/3m,kk)  =  GIW{Pm,k\rfk), 

(4.17) 

where  Gmk  are  the  points  in  group  Grn  from  the  /.  th  topic  according  to  zm.  The  last  equation  fol¬ 
lows  from  the  Gaussian-Inverse-Wishart-Gaussian  conjugacy.  r/k  is  the  parameter  of  the  posterior 
GIW  distribution  given  the  prior  parameters  rj  and  Gmk ;  its  form  can  be  found  in  standard  statistics 
textbooks  e.g.  [57]. 

For  zmn,  the  topic  membership  of  point  n  in  group  m  is  sampled  follows: 

P^Zmn  =  k\  •  )  OC  P(xmn\zmn  =  k:  P(zmn  =  A,j0m)  =  \xmn\(^m,k)  ■  (4.18) 

Note  that  the  multinomial  parameters  should  be  normalized  to  sum  to  1. 

Learning  Learning  the  parameters  of  FGM  helps  us  identify  the  groups’  and  points’  normal 
behaviors.  Each  of  the  genres  a  =  {oct}t=i,...,T  captures  one  typical  distribution  of  topic  weights 
as  9  ~  Vir(at).  The  topic  generators  r/  =  {r]k}k=i,...,K  determine  how  the  normal  topics  {/3m,k} 
should  look  like.  We  use  single-sample  Monte  Carlo  EM  [24]  to  leam  parameters  from  the  samples 
provided  by  the  Gibbs  sampler.  Given  sampled  latent  variables,  we  update  the  parameters  to  their 
maximum  likelihood  estimations:  we  leam  a  from  y  and  9;  q  from  f3;  and  p  from  y. 
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p  can  easily  be  estimated  from  the  histogram  of  y’s.  at  is  learned  by  the  MLE  of  a  Dirichlet 
distribution  given  the  topic  weights  of  groups  in  genre  t  i.e.  {0m\ym  =  t,  m  =  1, . . .  ,  M}.  The 
MLE  of  Dirichlet  can  be  solved  using  the  Newton-Raphson  method  [118]. 

The  kth  topic-generator’s  parameter  r/k  =  {/j0/,--  K0k,  T0/,,  vok}  is  the  MLE  of  a  GIW  distribu¬ 
tion  given  the  parameters  {jjrn,k  =  {pm,k,  ^m,k)}m=i,...,M  (the  A  th  topics  of  all  groups).  We  have 
derived  an  efficient  solution  for  this  MLE  problem.  The  details  can  be  found  at  the  end  of  this 
chapter. 

The  overall  learning  algorithm  works  by  repeating  the  following  procedure  until  convergence 
or  equilibrium:  (1)  do  Gibbs  sampling  to  infer  the  states  of  the  latent  variables;  (2)  update  the 
model  parameters  using  the  estimators  above.  If  we  only  want  to  infer  the  posterior  distributions 
of  the  latent  variables,  we  can  only  repeat  step  (1)  until  enough  samples  are  gathered  to  form  the 
approximate  empirical  distribution. 

Like  for  MGM,  to  select  appropriate  values  for  the  parameters  T  and  K  (the  number  of  genres 
and  topics),  we  can  apply  the  Bayesian  information  criterion  (BIC)  [149],  or  use  cross-validation 
to  find  values  that  maximize  the  learning  performances. 


4.5.2  Scoring  Functions 

FGM  can  easily  be  used  for  group  anomaly  detection.  We  can  first  infer  a  group’s  latent  states 
including  the  topics  (3  and  the  topic  weight  9,  and  then  examine  if  they  are  compatible  with  the 
topic  generators  and  genres  in  the  model. 

Point-based  anomalies  can  be  found  by  examining  the  topics.  If  a  group  contains  anomalous 
points,  then  the  topics  that  generated  these  points  will  deviate  from  the  topic  generators  r/.  Let 
P(j3m\(-))  =  UL  GIW((3m,k\i]k)-  We  define  the  point-based  anomaly  score  as 


fp(Gm)  =  [-  logP(/3m|0)]  =  -  f  P(J3m\G,  Gm)  \ogP((3m\Q)df3m.  (4.19) 

The  posterior  distribution  P(/3m  |@,  Gm )  can  again  be  approximated  by  the  samples  from  Gibbs 
sampling,  and  the  expectation  can  be  done  by  Monte  Carlo  integration.  Compared  to  the  finding 
point-based  anomalies  using  the  point-wise  likelihood  as  in  Section  4.4.2,  this  scoring  function 
examines  the  topic  instead  of  the  points.  The  topics  are  used  as  a  summarization  of  the  points  as  a 
whole,  so  that  problems  raised  in  Section  4.4.2  can  be  solved. 

Distribution-based  anomalies  can  be  detected  by  examining  the  topic  weights  like  in  MGM. 
The  genres  {a!t}t=i,...,M  capture  the  typical  distribution  of  topic  weights.  If  a  group’s  topic  weight 
9m  is  unlikely  under  these  genres,  we  call  it  anomalous.  Let  P{9m |0)  =  Yh=i  PtDir(9m\at).  The 
distribution-based  anomaly  score  is 


UGm)=  Eflm  [-logP(0m|0)] 


P(9m\Q,Gm)logP(9m\Q)d9m 


(4.20) 


Again,  this  expectation  can  be  approximated  using  Gibbs  sampling  and  Monte  Carlo  integration. 
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4.6  Nonparametric  Genre  models 


FGM  provides  us  with  great  flexibility  to  model  the  group-level  and  point-level  behaviors  of  the 
groups.  It  also  inspires  us  to  design  effective  scoring  function  to  find  different  types  of  group 
anomalies.  However,  such  flexibility  comes  with  the  a  price.  First,  the  inference  and  learning  of 
FGM  is  slower  than  MGM.  Specifically,  the  use  of  the  multivariate  Gaussian- Wishart  distribution 
greatly  increases  the  computation  needed  for  both  inference  and  learning.  Second,  the  conjugate 
priors  are  chosen  merely  for  the  computational  convenience  rather  than  correctness.  Thirdly,  the 
conjugate  priors  involves  a  large  number  of  free  parameters  that  needs  to  be  set  or  learned,  causing 
volatile  performance  and  difficulties  in  practical  use.  Therefore,  we  want  other  prior  distributions 
that  can  implement  similar  flexibilities  as  in  FGM,  but  runs  much  simpler  and  faster. 

The  nonparametric  empirical  Bayes  (NPEB)  method  can  be  used  to  solve  this  problem.  Instead 
of  methods  that  use  parametric  conjugate  priors,  NPEB  does  not  assume  the  form  of  the  prior 
distribution  to  allow  for  minimum  prior  knowledge  and  restrictions.  Applying  NPEB  to  FGM, 
we  can  replace  the  mixture  of  Dirichlet  distribution  for  topic  weights  P(9\p,  a)  and  the  Gaussian- 
Wishart  distributions  for  the  topics  {P(Pk\Vk)}k  with  simpler  P(6\Fg)  and  P(/3\Fp)  respectively, 
where  Fg,  Fp  are  the  nonparametric  distributions  for  6  and  /3  without  further  assumptions. 

We  use  the  nonparametric  maximum  likelihood  (NPML)  technique  proposed  by  [89,  90].  It 
can  be  proved  that  the  maximum  likelihood  estimates  (MLE)  of  F  are  step  functions  in  the  pa¬ 
rameter  space  i.e.  the  probability  mass  of  F  only  exists  at  a  finite  number  of  discrete  points  in  the 
parameters  space,  and  number  of  steps  in  F  grows  as  the  data  become  more  complex. 

The  above  result  means  that,  for  a  given  data  set,  the  MLE  Fg  contains  a  number  of  values 
for  6.  To  simplify  the  computational,  we  specify  the  number  of  steps  in  Fg  to  a  relatively  large 
value  beforehand,  instead  of  computing  it  from  the  data.  Similar  modeling  can  also  be  applied  to 
the  topic  generators.  In  this  case,  this  NPML  becomes  very  similar  to  the  mechanism  we  used  in 
MGM. 

We  use  the  simplified  NPML  method  above  to  improve  the  FGM  and  get  the  nonparametric 
genre  model  (NGM).  Suppose  that  Fe  the  prior  of  the  topic  weights  has  T  elements,  FPk  the  prior 
of  the  kth  topic  has  S  elements.  The  generative  process  of  NGM  can  be  described  in  Algorithm  6, 
and  its  graphical  representation  is  shown  in  Figure  4.4. 


Algorithm  6  The  generative  process  of  NGM. 
for  groups  m  —  1  to  M  do 

•  Choose  a  genre  ym  G  {1, . . .  ,T}  ,  ym  ~  M(p).  Let  the  topic  weight  9m  =  aVm  G  §A. 
for  topics  k  —  1  to  K  do 

•  Choose  an  rmk  G  {1, . . . ,  S}  ,  rmk  ~  M(nk),  nk  G  S5.  Let  the  kth  topic  be  f3k  =  Vk,rmk- 

for  n  —  1  to  Nm  do 

•  Choose  a  topic  zmn  G  {1, . . . ,  K},  zmn  ~  M{6m)- 

•  Generate  a  point  xmn  G  RD,  xmn  ~  P(xmn\f3Zrnn). 
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Figure  4.4:  The  nonparametric  genre  model  (NGM). 


The  complete  and  marginal  likelihood  of  data  under  NGM  are: 

p(ym,rm,zm:Gm\p,a:TT,  7])  =  p{ym\p )  \\p{rmkW)  \\_p{zmn\ym,  oi)p{xmn\zmm  rm,  rj)  (4.21) 

k  n 

=  Py 

m  n  '!rk,rmk  aym,z  mn  p(%mn  I P Zmn  if'm^Z'mn  ) 

k  n 

p{Gm\Pi  ot,  7r,  rj)  =  ^pym^]Jnk,rmkn^otym,Zmnp(xmn\r]Zrnn>rm'Zmn).  (4.22) 

Vm  I'm  k  Tl  Zmn 

NGM  does  not  assume  the  specific  forms  of  the  prior  distributions  for  topic  eights  6  and  the 
topics  (3,  therefore  with  a  suitable  choice  of  T  and  S  it  can  model  complex  behaviors  the  data 
represent.  The  model  only  involves  simple  distributions  such  as  Gaussians,  and  therefore  is  easy 
and  fast  to  learn. 

4.6.1  Inference  and  Learning 

When  NGM  is  learned  from  the  data,  the  nonparametric  priors  embodied  in  the  parameters  {p,  a} 
and  {7T k,7jk}k  will  contain  the  typical  topic  weights  as  well  as  topics  in  the  training  set.  These 
priors  can  then  help  the  inference  of  the  topic  weights  6  and  topics  {/3k}k=1  K. 

Like  other  genre  models,  the  NGM  model  can  be  learned  via  the  variational  EM  algorithm.  We 
define  for  P(ym ,  rm,  zrn\Grn,  0)  the  posterior  marginal  distribution  of  latent  variables  a  factorized 
variational  distribution 


T'mki  ^nk\^frm  (jrn)  Q^m  |*7Wi)  Q_(^rnk  l^rn/c)  J_  J_  9 nk  \  4*mn)  (4.23) 

k  n 

(l/m  |Tm)  1 AA  (t mk  |  Tmk)  |  J\A  ( Znk  |  0mn) 
k  n 

where  variational  distributions  q{y\nf)  models  the  genre  y,  q(r\r)  models  how  the  topic  was  sam¬ 
pled  from  the  nonparametric  topic  generators,  and  q{z\4>)  models  which  topic  generated  a  point. 

The  model  and  variational  parameters  can  be  obtained  by  maximizing  the  variational  lower- 
bound  to  the  marginal  likelihood  of  data  as  described  in  Section  4.4.1.  The  actual  derivation  is 
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similar  to  the  one  used  in  MGM.  In  the  following  we  only  show  the  update  formulae  for  the 
iterative  solution. 


(prrvnk  oc  exp  I  ^  7m,.  log  atk  +  ^  Tmks  logp(xmn\yks) 

\  t  s 

Imt  OC  exp  I  log  pt  +  ^2  (prank  log  atk 

\  n,k 

Tmks  OC  exp  I  log  7 Tks  +  ^  <t>mnk  log p(xmn\ Vks) 

V  n 

&tk  OC  ^  ^  'Jmt  ^  ^  (fomnk 
m  n 

P= 

m 

nk=  M  ^  Tmk 

m 

To  leam  the  topic  generators,  we  need  to  maximize  the  following  objective  function: 

Vks  =  arg  min  ^  Tmks(pmnk  log  p{xmn \yks)  • 

Vks  ™  „ 


(4.24) 

(4.25) 

(4.26) 

(4.27) 

(4.28) 

(4.29) 

(4.30) 


Therefore,  estimating  pks  is  the  same  as  fitting  a  Gaussian  distribution  using  samples  weighted  by 

Tmks^Pmnk- 


4.6.2  Scoring  Functions 

Finding  anomalies  based  on  the  inferred  latent  variables  is  not  straightforward  under  the  NGM. 
Due  to  the  use  of  nonparametric  priors,  we  do  not  have  explicit  latent  variables  such  as  the  topic 
weight  9  and  the  topics  (3  for  each  group.  So  instead  of  scoring  the  latent  variables,  we  find  ways 
to  score  the  data  directly. 

Point-based  Anomaly  Similar  as  before,  we  can  find  point-based  anomalies  by  looking  for 
groups  that  use  unusual  topics  to  generate  the  points.  Since  there  is  no  explicit  latent  variable 
for  the  topics,  we  can  use  the  following  scoring  function  to  score  the  points: 

Xp(Gm)  =  -  ^p(zm\Gm,  0)  logp(Gm\zm,  0)  ftj  —  q(zm\(j)m)  \ogp(Gm |zm,  7r,  y)  (4.31) 

Zm  Zm 

=  <7(zm|0m)  ^2  log  p(Gmk\nk,  yk) 

Zm  k 

-  ~  q{zm\4>m)  ^2  logy^  Tksp{Gmk\yks). 

Z  m  k  S 
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where  Gmk  contains  all  the  points  from  topic  k  according  to  zm.  The  above  equations  used  the 
variational  distributions  to  approximate  the  actual  posterior  distributions. 

The  key  quantity  in  Xp 's  h°w  to  compute  p(Gmk\r]ks),  which  means  how  likely  the  set  of 
points  Gmk  were  generated  by  the  Gaussian  distribution  J\f(riks).  The  most  intuitive  option  is  the 
likelihood  of  i.i.d.  points  as  p{G\rj)  =  Ylnp(xn\v )>  but  it  is  flawed.  For  example,  if  the  test  group 
contains  points  that  are  only  at  the  centers  of  the  Gaussians,  then  the  likelihood  will  be  mistakenly 
high,  because  the  distribution  of  the  points  is  not  normal.  This  problem  can  only  be  addressed  by 
considering  the  points  Gmk  as  a  whole.  This  is  also  why  in  Eq.  (4.12)  we  have  to  score  the  topic 
variables  zm  as  a  whole  instead  of  individually. 

Essentially,  we  need  p(Gmk\rjks)  to  be  a  goodness-of-fit  (GoF)  measurement.  Unfortunately, 
GoF  tests  in  high-dimensions  are  notoriously  difficult.  Here  we  take  a  parametric  approach.  First, 
we  construct  a  prior  distribution  for  77,  denoted  as  0  (77) .  Then,  we  estimate  a  distribution  for  Gmh 
that  has  the  same  parametric  form  as  77  using  fi(p)  as  the  Bayesian  prior,  denoted  as  f(Gmk ,  H(r/)). 
Finally,  we  use  p(f(Gmk,  0(77))  |0(t7))  as  a  surrogate  of  p{Gmk\r]).  Intuitively,  this  approach  us¬ 
es  the  parametric  distribution  f(Gmk,  U(4/))  to  summarize  Gmk,  and  then  evaluate  how  probable 
f{Gmk,  H(r/))  is  generated  from  the  model.  We  call  this  approach  the  pseudo-prior  method. 

We  choose  to  be  the  conjugate  prior  of  77,  and  set  the  mode  of  0(4/)  to  77  so  that  0(4/)  can 
reflect  what  eta  should  be  like.  Since  0(77)  is  conjugate  to  77,  estimating  f(Gmk,  0(77))  and  eval¬ 
uate  its  likelihood  under  0(77)  is  straightforward.  Another  advantage  of  this  approach  is  that,  the 
conjugate  prior  distributions  usually  have  a  degrees -of-freedom  parameter  to  specify  how  strong 
the  prior  is.  With  a  suitable  strength,  the  Bayesian  estimate  f(Gmk,  0(4/))  can  be  robust  against 
random  individual  points  and  focus  more  on  the  collective  behavior.  In  addition,  if  two  groups 
have  the  same  amount  of  anomalous  points,  then  the  larger  group  would  receive  a  higher  score. 

Concretely  for  our  NGM  model  where  each  rjk.s  is  a  Gaussian  distributions,  we  let  0(7//,,,,  A) 
be  the  GIW  distribution  whose  mode  is  at  rjks,  where  A  is  a  parameter  specifying  its  degrees-of- 
freedom.  A  acts  as  the  “pseudo  counts”  in  the  prior  distribution,  and  larger  A  makes  the  score  more 
insensitive  to  individual  points  or  smaller  groups.  To  evaluate  p(Gmk\riks),  we  first  estimate  the 
Gaussian  distribution  f(Gmk ,  U(4/,  A))  based  on  the  data  Gmk  and  the  prior  0(4//,.,,  A),  and  then 
calculate  the  GIW  likelihood  p  ( f{Gmk ,  0(77,  A)) |0(77fcs,  A)). 

Note  the  resemblance  between  the  pseudo-prior  scoring  function  and  the  point-based  scoring 
function  fp  (4.19)  for  FGM.  They  are  very  simpler  in  that  they  both  use  adaptive  topics  to  sum¬ 
marize  the  points  and  then  score  the  topics.  FGM  explicitly  learns  the  adaptive  topics  during 
training.  On  the  other  hand,  NGM  construct  the  adaptive  topics  only  during  detection  time  using 
pseudo-priors,  making  the  training  much  simpler  while  achieving  similar  results. 

Distribution-based  Anomaly  The  above  pseudo-prior  approach  can  also  be  used  to  find  distri¬ 
bution  based  anomalies  using  the  topic  variables  zm.  The  pseudo-prior  scoring  function  to  find 
distribution-based  anomalies  is 

Xd(Gm )  =  -^p(zm|Gm,0)logp(zm|0)  «  -  Y.  q( zm)  log  irtp(zm\at)  (4.32) 

Zm  Zm  t 

To  evaluate  p(zm \at),  we  first  construct  the  pseudo-prior  Cl(at,  A)  =  Vir(at,  A)  where  A  is  the 
pseudo-counts  in  the  Dirichlet,  then  estimate  the  posterior  topic  weights  9m  =  /( zm,  kl(at,  A)), 
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and  finally  evaluate  the  Dirichlet  likelihood  p  (f(zm,  H(af,  A))|D(a:t,  A))  =  'Dir(y9rn\at,  A). 

Note  the  scoring  function  based  on  multinomial  likelihood  (4.12)  can  still  work  well  in  NGM. 
In  fact,  (4.12)  is  simpler  and  more  natural  to  score  the  discrete  variables  zm.  The  pseudo-prior 
approach  was  only  proposed  as  a  remedy  to  the  difficulty  of  evaluating  the  goodness-of-fit  for 
continuous  multidimensional  data. 


4.7  Discussion 

In  the  previous  sections,  we  progressively  proposed  three  models:  the  multinomial  genre  models 
(MGM),  the  flexible  genre  models  (FGM),  and  the  nonparametric  genre  models  (NGM).  MGM  is  a 
basic  model  that  introduces  the  concept  of  genre  and  use  it  the  enhance  LDA’s  modeling  capability 
of  topic  weights.  FGM  enhances  MGM  by  using  more  flexible  probabilistic  components  and 
allowing  the  groups  to  have  different  topics.  Inspired  by  FGM,  NGM  uses  nonparametric  priors  to 
further  remove  modeling  assumptions  and  enable  faster  learning. 

The  computational  cost  of  the  genre  models  mainly  comes  from  the  inference  procedures, 
where  we  have  to  compute  the  point  likelihood  given  the  topic/topic  generator  i.e.  p(x\/3)  orp(x\  ij). 
For  the  D-dimensional  Gaussians  used  in  this  chapter,  computing  the  likelihood  for  N  points 
w.r.t.  to  all  topics  costs  0(NKD'2)  time  (or  0(N KSD2)  for  NGM).  To  make  it  faster,  we  can  first 
reduce  D  the  dimensionality  of  the  points  using  reduction  algorithms  such  as  PCA.  Alternatively, 
we  can  use  Gaussians  with  diagonal  covariances  so  that  the  time  complexity  can  be  reduced  to 
O(NKD).  Further,  when  there  are  many  groups  or  the  groups  are  very  large,  we  can  use  a  subset 
of  the  groups/points  to  get  initial  estimates  of  the  models  that  can  be  refined  later. 

The  parameters  T  and  K  are  needed  to  specify  the  genre  models.  Usually  these  values  can  be 
selected  by  AIC/BIC  scores  or  cross-validation.  Theoretically,  all  the  parameters  of  the  prior  dis¬ 
tributions  in  FGM  (specifically  the  Dirichlet  parameters  a  and  GIW  parameters  rf)  can  be  learned 
via  the  empirical  Bayes  methods.  When  these  parameter  are  learned  correctly  good  performance 
can  be  achieved.  However,  in  practice  we  found  that  such  an  approach  is  usually  unstable  and  may 
lead  to  bad  local  minima.  On  the  other  hand,  finding  a  good  fixed  value  for  them  involves  much 
tuning.  NGM  partly  solves  this  problem  by  controlling  the  prior  complexity  via  the  number  steps 
in  the  prior  distribution,  which  is  more  intuitive  and  easier  to  tune. 

Multiple  scoring  functions  were  proposed  for  each  model  to  find  both  the  point-based  and 
distribution-based  anomalies.  Although  it  is  tempting  to  find  a  ubiquitous  scoring  function,  such 
attempts  are  usually  futile  as  the  definition  of  anomalies  depends  on  specific  problem  e.g.  the  im¬ 
portance  of  point-based  and  distribution-based  anomalousness  are  different  in  different  problems 
or  for  different  users.  In  practice,  we  suggest  try  multiple  scoring  functions  to  find  the  anomalies 
that  are  particularly  interesting. 

Apparently,  when  5  =  1  NGM  is  equivalent  to  MGM.  To  further  see  their  relationship,  note 
that  the  marginal  likelihood  of  NGM  (4.22)  can  also  be  written  as 

p(Gm\p,  a,  7r,  rj)  =  Em  j  'y  ^  Py-m  nE  aym,ZmnP(Xrnn  |  Vzmn^m^mn  )  '  (4.33) 

I’m  \  k  /  l/m  n  zmn 
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Comparing  (4.33)  to  the  MGM  marginal  likelihood  (4.2),  we  see  that  NGM  can  be  considered  as  a 
mixture  of  K  x  S  dependent  MGM  models  (using  only  K  x  S  independent  Gaussian  components), 
with  structured  mixing  weights  specified  by  n.  We  found  that  NGM  behaves  like  an  MGM  model 
with  K  x  S  topics.  Further,  note  that  the  pseudo-prior  scores  for  NGM  described  in  Section  4.6.2 
can  also  be  applied  to  MGM.  Considering  all  their  similarities,  we  conclude  that  when  simplicity 
and  efficiency  are  important,  NGM  can  be  replaced  by  MGM  when  the  number  of  topics  is  large. 
Indeed  in  practice  we  found  that  the  difference  between  MGM  and  NGM  are  insignificant,  which 
makes  the  simpler  MGM  a  more  cost-effective  choice. 

The  genre  models  are  able  to  leam  from  groups  with  different  numbers  of  points  effectively. 
In  training,  the  smaller  groups  have  less  influence  on  the  likelihood  of  the  data.  In  prediction, 
the  scoring  functions  assume  that  a  group  is  normal  unless  some  evidence  of  anomaly  is  observed, 
and  the  anomalousness  increases  as  the  group  size  becomes  larger  (between  two  anomalous  groups 
with  the  same  distribution  of  points,  the  larger  group  has  a  higher  anomaly  score).  These  behaviors 
help  us  ignore  the  noises  and  focus  on  the  real  anomalies. 

In  addition  to  detecting  group  anomalies,  the  genre  models  can  also  be  used  to  accomplish 
other  learning  tasks.  For  example,  as  in  [50]  the  genres  can  be  used  to  cluster  the  groups  together. 
Using  techniques  similar  to  naive  Bayes,  we  can  use  genre  models  to  classify  groups.  In  addition, 
the  genres  and  topic  generators  provides  a  natural  summary  of  the  data  that  can  help  us  explore  the 
data  sets. 

Finally,  the  generative  methods  can  also  be  used  learn  structured  groups,  where  a  point  might 
depend  on  other  points  in  the  same  group.  The  idea  is  that,  first  for  each  group  we  find  a  generative 
model  to  generate  the  points  in  it,  then  we  find  a  global  mechanism  to  generate  those  generative 
models.  In  genre  models,  the  generative  models  are  mixtures  of  topics,  while  the  global  mech¬ 
anism  is  realized  by  the  genres  and  the  topic  generators.  To  handle  structured  groups,  we  can 
use  generative  models  such  as  the  hidden  Markov  models  (HMM)  and  the  random  Markov  field 
(MRF),  and  then  try  to  design  suitable  mechanism  to  generate  the  HMMs  and  MRFs  . 


4.8  Experiments 

In  this  section  we  provide  empirical  results  produced  by  the  genre  models  on  both  synthetic  and  real 
data.  We  demonstrate  the  behaviors  of  different  models,  and  show  their  effectiveness  in  detecting 
various  group  anomalies. 

4.8.1  Synthetic  Data 

First,  we  demonstrate  the  behaviors  of  the  genre  models  and  the  scoring  functions  on  a  synthetic 
data  set.  The  data  set  is  described  below.  We  generated  the  data  using  2-dimensional  Gaussian 
mixture  models  (GMM).  Each  group  has  a  GMM  to  generate  its  points.  All  GMMs  share  three 
Gaussian  components  with  covariance  0.2  x  I2  and  centered  at  points  (—1.7,  —1),  (1.7,  —1),  and 
(0,  2),  respectively.  A  group’s  mixing  weights  are  randomly  chosen  from  w\  =  [0.33,  0.33,  0.33] 
or  W2  =  [0.84,0.08,0.08].  Thus,  a  group  is  normal  if  its  points  are  sampled  from  these  three 
Gaussians,  and  their  mixing  weights  are  close  to  either  w \  or  i/x .  To  test  the  detectors,  we  injected 
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both  point-based  and  distribution-based  anomalies,  point-based  anomalies  were  groups  of  points 
sampled  from  JV  ((0,  0),  I).  Distribution-based  anomalies  were  generated  by  GMMs  consisting 
of  normal  Gaussian  components  but  with  mixing  weights  [0.33,  0.64,  0.03]  and  [0.08,  0.84,  0.08], 
which  were  different  from  w\  and  w2 .  We  generated  M  =  100  groups,  each  of  which  had  Nrn  ~ 
Poisson(  100)  points.  One  point-based  anomalous  group  and  two  distribution-based  anomalous 
groups  were  injected  into  the  data  set. 

The  detection  results  of  MGM,  FGM,  NGM,  as  well  as  LDA  are  shown  in  Fig.  4.5.  For  LDA, 
we  use  FGM’s  point-based  score  (4.20).  We  show  12  out  of  the  100  groups.  Normal  groups  are 
surrounded  by  black  solid  boxes,  point-based  anomalies  have  green  dashed  boxes,  and  distribution- 
based  anomalies  have  red  dashed  boxes.  Points  are  colored  by  the  anomaly  scores  of  the  groups 
(darker  color  means  more  anomalous).  An  ideal  detector  would  make  dashed  boxes’  points  dark 
and  solid  boxes’  points  light  gray.  The  method  postfix  “-D”  means  distribution-based  scores,  and 
“-P”  means  point-based  scores. 

We  can  see  that  the  genre  models  can  all  find  the  distribution-based  anomalies  since  they  are 
able  to  learn  the  complex  distribution  of  the  topic  weights.  But  LDA  lacks  the  flexibility  to  capture 
the  simple  yet  multi-modal  distribution  of  the  topic  weights  in  this  data  set.  When  we  merge  the 
results  of  the  point-based  and  distribution-based  scores,  all  the  injected  group  anomalies  can  be 
found  by  the  genre  models.  We  notice  that  MGM  is  less  sensitive  to  the  point-based  anomaly.  The 
explanation  is  simple;  the  anomalous  points  are  distributed  in  the  middle  of  the  topics,  thus  the 
inferred  topic  weight  is  around  [0.33,0.33,0.33],  which  is  exactly  w\.  As  a  result,  MGM  infers 
this  group  to  be  normal,  although  it  is  not.  This  example  shows  one  possible  problem  of  scoring 
groups  based  on  topic  weights  only.  On  the  other  hand,  with  adaptive  topics,  FGM  and  NGM 
managed  to  identify  the  point-based  anomaly  even  with  the  distribution-based  scoring  function. 
We  also  observed  that  the  MGM-P  score  is  slightly  more  noisy  than  FGM-P  and  NGM-P,  probably 
because  MGM  is  scoring  individual  points  while  FGM  and  NGM  are  scoring  topics  that  are  more 
stable. 

Figures  4.6b  -  4.6c  show  the  density  estimations  produced  by  LDA,  MGM,  and  FGM,  respec¬ 
tively,  for  the  point-based  anomalous  group.  We  can  see  that  FGM  gives  a  better  estimation  due 
to  its  adaptive  topics,  while  LDA  and  MGM  are  limited  to  use  their  global  topics.  Figure  4.6d 
shows  the  learned  genres  visualized  as  the  distribution  Y^t  ptVir(-\at)  on  the  topic  simplex.  This 
distribution  summarizes  the  normal  topic  weights  in  this  data  set.  Observe  that  the  two  peaks  in 
the  probability  simplex  are  very  close  to  w\  and  w2  indeed. 

4.8.2  Image  Data 

In  this  experiment  we  test  the  performances  of  the  genre  models  on  detecting  anomalous  scene 
images.  We  use  the  OT  data  set  from  [126],  which  contains  8  outdoor  scene  categories.  There  are 
2,  688  images  in  total,  each  having  about  256  x  256  pixels.  A  more  detailed  description  of  this  data 
set  can  be  found  in  Section  5.6.3. 

We  use  the  first  100  images  from  each  category  in  our  experiments.  The  images  are  represented 
as  in  [50]:  we  treat  each  image  as  a  group  of  local  patches.  We  densely  sample  about  400  patches 
on  a  regular  grid  from  each  image,  and  on  each  patch  extract  the  128-dimensional  SIFT  [106] 
feature  vector,  and  then  reduce  its  dimension  to  10  using  PCA. 
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Figure  4.5:  Detection  results  on  the  synthetic  data.  Black  boxes  are  normal  groups.  Green  dashed 
boxes  are  point-based  anomalies.  Red  dashed  boxes  are  distribution-based  anomalies.  The  method 
postfix  “-D”  means  distribution-based  scores,  and  “-P”  means  point-based  scores. 


In  addition  to  the  genre  models,  we  also  test  several  other  simple  detector  to  compare.  A 
Gaussian  mixture  models  (GMM)  based  method  is  implemented  to  detect  point-based  anomalies. 
This  method  flatten  the  groups  and  fits  a  GMM  to  all  the  training  data  points.  Then  it  computes 
the  points’  likelihood  in  the  test  groups  under  the  GMM  as  their  anomaly  scores,  and  finally  scores 
a  group  by  averaging  the  points’  scores.  In  other  words,  the  GMM  method  finds  groups  with 
the  most  points  in  the  low-density  regions.  To  be  able  to  detect  distribution-based  anomalies, 
we  also  implemented  another  competitor  called  LDA-KNN.  LDA-KNN  uses  LDA  to  estimate  the 
topic  weights  in  the  groups  and  treats  these  topic  weights  (parameter  vectors  of  the  multinomials) 
as  the  groups’  features.  Then,  a  KNN  based  point  anomaly  detector  [182]  is  used  to  score  the 
groups’  feature  vectors.  Finally,  we  examine  an  adaptation  of  the  Theme  Model  (ThM)  [50].  The 
original  ThM  handles  only  discrete  data  and  was  proposed  for  clustering.  To  handle  continuous 
data,  we  modified  ThM  by  using  Gaussian  topics.  Essentially,  ThM  is  a  simplified  version  of  FGM 
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Figure  4.6:  (a),(b),(c)  show  the  density  of  the  point-based  anomaly  estimated  by  LDA,  MGM,  and 
FGM  respectively.  In  LDA  and  MGM,  topics  must  be  shared  globally,  therefore  their  perform 
badly,  (d)  The  genres  in  the  synthetic  data  set  learned  by  FGM. 


without  the  adaptive  topics.  We  can  then  apply  the  scoring  function  (4.20)  to  find  distribution- 
based  anomalies,  and  the  data  likelihood  to  find  the  point-based  anomalies.  The  following  list 
summarizes  the  detectors: 

•  P:  point-based  detector  using  GMM. 

•  MGM-P:  point-based  detector  using  the  MGM  likelihood. 

•  MGM-PP:  point-based  detector  using  MGM  with  the  pseudo-prior  scorer  (4.31). 

•  ThM-P:  point-based  detector  using  the  ThM  likelihood. 

•  NGM-P:  point-based  detector  using  NGM  with  the  pseudo-prior  scorer  (4.31). 

•  FGM-P:  point-based  detector  using  FGM  with  the  scorer  (4.19). 

•  LDA-KNN:  distribution-based  detector  using  KNN  and  the  topic  weights  learned  by  LDA. 

•  MGM-D:  distribution-based  detector  using  MGM  with  scorer  (4.12). 

•  ThM-D:  distribution-based  detector  using  ThM  with  scorer  (4.20). 

•  NGM-D:  distribution-based  detector  using  NGM  with  scorer  (4.12). 

•  FGM-D:  distribution-based  detector  using  FGM  with  scorer  (4.20). 

For  all  the  models  we  used  K  =  8  topics  and  T  =  6  genres  as  suggested  by  BIC  searches.  For 
FGM,  we  set  =  u0  =  N  where  N  is  the  average  size  of  the  groups  (hence  the  topic  generators  in 
FGM  have  low  variances).  For  NGM,  we  set  S'  =  3  so  that  each  topic  generator  contains  3  possible 
elements.  The  performance  is  measured  by  the  area  under  the  ROC  curve  (AUC)  of  retrieving  the 
anomalies  from  the  test  set. 


Finding  Out-of-Category  Anomalies 

The  first  type  of  image  anomalies  we  test  are  out-of-category  anomalies.  In  each  run,  we  randomly 
select  one  category  as  the  normal  class  and  use  its  images  to  train  the  genre  models.  At  test  time, 
we  mix  images  from  another  category  into  some  normal  images  as  anomalies,  and  ask  the  models 
to  find  them.  The  anomalies  in  this  experiment  are  not  controlled  and  can  be  of  any  type.  Note  that 
the  training  and  testing  images  do  not  overlap.  In  each  run,  we  select  one  normal  category  and  one 
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abnormal  category.  Then,  we  use  80%  of  the  images  in  the  normal  category  for  training,  and  use 
the  rest  20%  combined  with  the  images  in  the  abnormal  category  for  testing. 

The  results  of  56  random  runs  are  reported  in  Figure  4.7.  In  general,  point-based  detectors 
did  better  than  the  distribution-based  detectors,  which  is  expected  since  different  scene  images  are 
likely  to  have  distinctive  patches.  Our  method  MGM-PP,  NGM-P,  and  FGM-P  did  significantly 
better  than  others.  Note  that  the  point-based  detector  MGM-PP  performed  better  than  the  MGM-P, 
showing  the  advantage  of  the  pseudo-prior  approach.  FGM-P  performed  the  best.  Though  not 
apparent  in  the  boxplot  due  to  the  high  variance  of  data,  the  advantage  of  FGM-P  is  significant: 
between  FGM-P  and  the  next  best  NGM-P,  the  p- value  of  the  Wilcoxon  signed  rank  test  is  0.035 
(the  signed  rank  test  was  used  because  the  distribution  of  the  accuracies  were  highly  skewed.  For 
reference  the  paired  t-test  has  a  p- value  of  0.028).  On  the  other  hand,  several  distribution-based 
detectors  also  did  well,  but  ThM-D  and  FGM-D  failed  this  task  because  in  this  complex  data  set 
the  estimation  of  Dirichlet  genres  became  unstable.  Finally,  we  observe  that  MGM  and  NGM 
performed  very  similarly. 


Figure  4.7:  Performances  on  detecting  out-of-category  images.  See  text  for  details. 


Finding  Stitched  Images 

The  second  type  of  image  anomalies  are  stitched  images.  The  purpose  here  is  to  find  unnatural 
images.  In  each  run,  we  select  two  categories  as  the  normal  classes,  and  then  divide  the  images  in 
these  two  classes  into  training  and  testing  sets.  We  create  anomalies  by  stitching  random  pairs  of 
images  (horizontally  half-by-half)  from  different  categories  in  the  testing  set.  The  stitched  images 
are  then  added  to  the  testing  set,  and  the  goal  is  to  find  these  unnatural  synthesized  images.  Note 
that  unlike  the  previous  experiment,  the  anomalies  here  are  controlled;  the  normal  test  images  and 
the  anomalies  consist  of  exactly  the  same  patches,  and  none  of  them  overlap  with  the  training 
images.  For  instance,  an  anomaly  may  be  a  picture  that  is  half  mountain  and  half  city  street.  Some 
examples  are  shown  in  Figure  4.8.  When  extracting  the  SIFT  features,  points  near  the  stitching 
boundaries  are  discarded  to  avoid  boundary  artifacts. 
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Figure  4.8:  Images  samples.  Green  boxes  (first  row)  contain  natural  images,  and  yellow  boxes 
(second  row)  contain  stitched  anomalies. 


We  use  the  same  data  as  the  previous  experiment.  In  each  run,  we  randomly  select  two  cate¬ 
gories  and  use  80%  of  the  images  in  both  categories  for  training.  The  rest  20%,  combined  with  the 
synthesized  stitched  images,  are  used  for  testing.  The  number  of  normal  testing  images  and  the 
number  of  anomalies  are  equal. 

The  performances  from  56  random  runs  are  shown  in  Figure  4.9.  As  expected,  in  contrary  to  the 
previous  experiment,  the  distribution  based  methods  are  much  better  than  the  point-based  methods 
since  by  construction  there  is  no  point-based  anomalies.  Particularly,  the  methods  that  are  based 
on  Dirichlet  genres,  including  ThM-D  and  FGM-D,  lead  the  performance  by  a  large  margin.  The 
difference  between  ThM-D  and  FGM-D  are  negligible,  meaning  that  the  adaptive  topics  of  FGM 
had  little  use  since  there  are  no  point  anomalies.  Again,  MGM  and  NGM  performed  similarly. 
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Figure  4.9:  Performances  on  detecting  stitched  images. 
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4.8.3  Turbulence  Data 


We  present  an  explorative  study  of  detecting  group  anomalies  on  turbulence  data  from  the  JHU 
Turbulence  Database  Cluster1  (TDC)  [127].  TDC  simulates  fluid  motion  through  time  on  a  3- 
dimensional  grid,  and  here  we  perform  our  experiment  on  a  continuous  1283  sub-grid.  In  each 
time  step  and  each  vertex  of  the  grid,  TDC  records  the  3-dimensional  velocity  of  the  fluid.  We 
consider  the  vertices  in  a  local  cubic  region  as  a  group,  and  the  goal  is  to  find  groups  of  vertices 
whose  velocity  distributions  ( i.e .  moving  patterns)  are  unusual  and  potentially  interesting.  The 
following  steps  were  used  to  extract  the  groups:  (1)  We  chose  the  {(8i,  8 j,  Sk)}UJj,  grid  points  as 
centers  of  our  groups.  Around  these  centers,  the  points  in  73  sized  cubes  formed  our  groups.  (2) 
The  feature  of  a  point  in  the  cube  was  its  velocity  relative  to  the  velocity  at  its  cube’s  center  point. 
After  these  pre-processing  steps,  we  had  M  =  4  096  groups,  each  of  which  had  342  3-dimensional 
feature  vectors. 

We  applied  MGM-D,  ThM-D,  and  FGM-D  to  find  anomalies  in  this  group  data.  T  =  4  genres 
and  K  =  6  topics  were  used  for  all  methods.  We  do  not  have  a  groundtruth  for  anomalies  in  this 
data  set.  However,  we  can  compute  the  “vorticity  score”  [115]  for  each  vertex  that  indicates  the 
tendency  of  the  fluid  to  “spin”.  Vortices  and  especially  their  interactions  are  uncommon  and  of 
great  interest  in  the  field  of  fluid  dynamics.  This  vorticity  can  be  considered  as  a  hand  crafted 
anomaly  score  based  on  expert  knowledge  of  this  fluid  data.  We  do  not  want  an  anomaly  detector 
to  match  this  score  perfectly  because  there  are  other  “non-vortex”  anomalous  events  it  should  find 
as  well.  However,  we  do  think  higher  correlation  with  this  score  indicates  better  anomaly  detection 
performance. 

Figure  4.10  visualizes  the  anomaly  scores  of  FGM  and  the  vorticity.  We  can  see  that  these 
pictures  are  highly  correlated,  which  implies  that  FGM  was  able  to  find  interesting  turbulence 
activities  based  on  velocity  only  and  without  using  the  definition  of  vorticity  or  any  other  expert 
knowledge.  Correlation  values  between  vorticity  and  the  MGM,  ThM,  and  FGM  scores  from  20 
random  runs  are  displayed  in  Fig.  4.10c,  showing  that  FGM  is  better  at  finding  regions  with  high 
vorticity. 


4.9  Summary 

We  presented  an  parametric,  generative  approach  to  model  collective  data,  and  use  it  for  the  group 
anomaly  detection  problem.  Using  topic  modeling  techniques,  we  proposed  the  Multinomial  genre 
models  (MGM),  the  flexible  genre  models  (FGM),  and  the  nonparametric  genre  models  (NGM) 
that  are  able  to  capture  complex  group  behaviors  at  multiple  levels,  while  archiving  a  better  bal¬ 
ance  between  the  model  flexibility  and  learning  efficiency  progressively.  Several  scoring  function 
are  also  proposed  specifically  to  exploit  the  capability  of  the  models  to  detect  group  anomalies. 
Empirical  results  show  that  genre  models  can  model  the  generating  process  of  the  collective  data 
and  detect  various  group  anomalies  well.  However,  since  the  anomalies  vary  a  lot  depending  on 
the  data  sets,  we  need  to  choose  the  best  model  and  scoring  function  in  order  to  achieve  the  best 
results. 

1  http://turbulence.pha.jhu.edu 
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In  the  future,  we  would  like  to  make  the  genre  models  robust.  So  far  we  have  used  the  genre 
models  to  find  outliers  in  the  testing  set.  However,  when  the  training  set  is  contaminated  by  out¬ 
liers,  the  learned  model  might  be  distorted  towards  the  outliers.  Therefore,  we  want  the  model 
to  only  leam  normal  behaviors  even  if  it  was  trained  on  a  data  set  that  contains  a  few  outlier- 
s.  Initial  attempt  has  been  made  in  this  direction  using  long-tail  distributions  (e.g.  the  student-f  s 
distribution)  as  the  model  parameters,  yet  the  resulting  model  is  overly  complex,  involving  many 
free  parameters,  and  unstable  during  practice.  We  shall  continue  to  investigate  more  reliable  ap¬ 
proaches.  Using  techniques  from  Gaussian  processes  [137],  we  can  also  extend  the  genre  models 
to  functional  observations  where  each  point  is  a  noisy  observation  of  a  function. 
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(a)  FGM-DB  Score  (b)  Vorticity 


Figure  4.10:  Detection  results  for  the  turbulence  data,  (a)  &  (b)  FGM-DB  anomaly  score  and 
vorticity  visualized  on  one  slice  of  the  cube,  (c)  Correlations  of  the  anomaly  scores  with  the 
vorticity. 
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MLE  of  the  Gaussian-Inverse-Wishart  Distribution 


We  present  the  MLE  for  a  Gaussian-Inverse-Wishart  (GIW)  distribution  GIW (//0,  k0.  T0,  ^o)  giv¬ 
en  a  set  of  Gaussian  distributions  with  parameters  f3  =  {/ 3m  =  (/im,  Em)}m=1  . ,  By  it¬ 
eratively  updating  the  parameters,  we  converge  to  a  stationary  point  of  the  likelihood  function 
Lgiw((3]  /i o,  no,  'L0,  ^o)-  For  fM0k,  nok,  and  'Lo/c,  we  can  derive  direct  solutions  by  setting  the  partial 
derivatives  of  the  log-likelihood  to  zero: 


/^o  s  y  ^  >  y  ^  Em  ^ri 


MD 


«o  = 


(AGi  /^O) 

-1 

1 


'Ln  =  Z/n  <  - 

0  0  '  j\// 


m  J 


where  D  denotes  the  feature  dimension. 

The  partial  derivative  w.r.l.  z/0  is  given  as  follows: 


dL  _  1 
duo  2 


I* 


o/2)  [>  , 


(4.34) 

(4.35) 

(4.36) 


(4.37) 


where  V’d(')  =  f^ry  stands  for  the  first  order  derivative  of  the  multivariate  log-gamma  function. 
z/0  does  not  have  an  analytical  solution.  To  address  this  issue,  observe  that  z/0  is  a  scalar,  and  T(l 
has  a  simple  linear  dependency  on  z/0.  Thus,  we  can  apply  a  one  dimensional  search  to  find  the 
optimal  z/0,  e.g.  ,  using  numerical  differentiation.  Having  z/0  we  can  compute  T0. 
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Chapter  5 

Discriminative  Methods  for  Collective  Data 


We  introduce  a  new  discriminative  learning  method  for  classification  on  collective  data.  Unlike 
generative  models,  discriminative  methods  tries  to  learn  target  concepts  directly  regardless  of  the 
generating  mechanism  of  data.  In  this  chapter  we  describe  discriminative  ways  of  learning  from 
collective  data  based  on  similarity  or  dissimilarity  measures  between  groups.  The  advantage  of 
this  approach  is  its  great  flexibility,  since  we  can  take  advantage  of  the  existing  tools  that  rely 
on  similarities  to  accomplish  a  vast  variety  of  tasks  on  groups.  We  use  consistent  nonparametric 
divergence  estimators  to  define  new  kernels  over  the  groups/sets,  and  then  apply  them  in  kernel 
classifiers.  Our  results  on  image  classification  demonstrate  that  in  many  cases  this  approach  can 
outperform  state-of-the-art  competitors  on  both  simulated  and  challenging  real-world  datasets. 


5.1  Introduction 

We  propose  new  methods  for  the  classification  of  distributions.  In  the  classification  problem  our 
goal  is  to  find  a  map  from  the  space  of  distributions  to  the  space  of  class,  while  in  the  anoma¬ 
ly  detection  problem  we  want  to  find  distributions  that  are  unlike  others.  Note  that  only  finite 
i.i.d.  samples  are  observed  from  these  distributions.  For  this  purpose  we  extend  the  support  vector 
machines  (SVM)  to  the  space  of  distributions.  In  our  framework,  some  of  the  distributions  in  the 
training  data  will  play  the  role  of  support  vectors. 

We  consider  this  problem  in  the  context  of  image  classification.  There  are  numerous  exam¬ 
ples  in  computer  vision  where  images  are  represented  by  unordered  sets  of  feature  vectors.  For 
example,  the  shapes  of  an  object  can  be  represented  by  sets  of  local  descriptors  at  edges  and  corner 
points  [63].  Human  faces  can  also  be  described  by  sets  of  local  image  patches  containing  cer¬ 
tain  facial  parts.  The  SIFT  [106],  HOG  [36],  and  PHOG  [8]  features  extractors  find  stable  image 
representations  by  detecting  sets  of  local-affine  invariant  regions  and  other  regions  of  interest. 

To  compare  images  represented  by  feature  sets,  a  straightforward  approach  is  to  treat  the  sets 
as  if  they  contained  instances  sampled  from  an  unknown  and  possibly  high-dimensional  distribu¬ 
tion.  A  common  way  to  handle  these  distributions  is  to  use  (high-dimensional)  histograms  through 
discretization,  and  compare  these  histograms.  The  popular  “Bag-of-words”  (BoW)  algorithms  use 
this  approach:  they  treat  each  image  as  a  set  of  visual  words,  where  the  words  are  obtained  by 
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clustering  local  image  patches  [50,  97]. 

Histogram-based  representations  have  been  used  in  many  state-of-the-art  computer  vision  algo¬ 
rithms.  However,  they  have  some  obvious  limitations.  When  we  discretize  continuous  distributions 
into  bins,  we  might  lose  valuable  information.  This  problem  is  especially  severe  in  high  dimen¬ 
sions,  where  the  curse  of  dimensionality  makes  histogram-based  density  estimators  unreliable. 
Selecting  the  bin  sizes  (or  number  of  bins)  for  the  histograms  are  also  difficult  model  selection 
problems. 

In  this  chapter  we  propose  new  classification  algorithms  that  operate  directly  on  the  set-of- 
vectors  representation  of  the  images.  We  assume  that  the  elements  of  these  sets  are  i.i.d.  sample 
points  from  unknown  distributions  that  characterize  the  images.  In  order  to  classify  the  images, 
we  classify  these  distributions  based  on  their  i.i.d.  sample  set  representations.  The  kernel-based 
approach  is  adopted:  we  introduce  and  estimate  the  kernel  functions  between  these  distributions. 
Having  the  estimated  kernel  matrix,  we  then  apply  kernel  classifiers  such  as  SVM  for  classification. 
The  proposed  kernels  avoid  the  traditional  clustering,  quantization,  or  histogram  building  steps  that 
could  lead  to  loss  of  information. 

These  kernel  functions  on  sets  will  be  defined  in  terms  of  divergences/distances,  just  as  the 
Euclidean  distance  is  used  to  define  Gaussian/RBF  kernels  on  vectors.  To  this  end,  we  will  need 
to  estimate  the  divergences  between  distributions.  A  straightforward  approach  would  be  to  esti¬ 
mate  the  underlying  densities  and  plug  them  into  the  corresponding  divergence  formulae.  In  fact, 
histogram  and  BoW  approaches  follow  this  paradigm.  Density  estimation,  however,  is  among  the 
most  difficult  problems  in  statistics  due  to  the  curse  of  dimensionality.  To  avoid  this  problem,  we 
develop  our  kernels  based  on  a  direct  (no  density  estimation  required)  and  nonparametric  (minimal 
assumptions  about  the  true  distributions)  approach.  We  show  how  to  estimate  a  large  family  of 
divergences  that  includes  the  Renyi,  Tsallis,  Hellinger,  Bhattacharyya,  KL,  L2,  and  many  other 
divergences.  The  estimator  is  provably  consistent,  nonparametric,  and  does  not  use  histogram- 
s,  kernel  density  estimators  (KDE),  or  any  other  density  estimators.  It  depends  on  only  simple 
k- nearest  neighbor  (KNN)  statistics. 

We  evaluate  the  empirical  performance  of  the  proposed  kernels  on  both  simulated  and  real- 
world  datasets,  and  compare  them  to  alternatives  based  on  density  estimation  or  parametric  ap¬ 
proximations.  We  show  that  our  kernels  achieve  performances  that  match  or  beat  the  state  of  the 
art  in  several  image  classification  tasks. 

The  chapter  is  organized  as  follows.  In  the  next  section  we  review  some  related  work.  We 
formally  introduce  the  distribution  classification  problem  and  show  how  to  define  kernels  on  dis¬ 
tributions  in  Section  5.3.  Section  5.4  and  5.5  describes  how  to  estimate  the  kernels  on  distributions 
when  the  densities  are  unknown.  Section  5.6  presents  the  results  of  numerical  experiments.  We 
conclude  with  a  discussion  in  Section  5.7. 


5.2  Related  Work 

Although  several  methods  exist  to  measure  the  distance  between  sample  sets,  and  kernels  have  also 
been  defined  on  sets,  all  of  these  previous  methods  have  their  shortcomings.  We  will  now  review 
the  most  popular  methods. 
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Nguyen  et  al.  recently  proposed  a  method  for  /-divergence  estimation  using  its  so-called  “vari¬ 
ational  characterization  properties”  [125].  This  approach  involves  an  intractable  optimization  over 
an  infinite-dimensional  function  space.  When  this  function  space  is  chosen  to  be  a  reproduc¬ 
ing  kernel  Hilbert  space  (RKHS),  this  optimization  problem  reduces  to  an  iV-dimensional  convex 
problem,  where  N  is  the  sample  size.  This  can  be  very  demanding  in  practice  for  a  only  few 
thousand  sample  points,  which  is  quite  common  in  computer  vision  applications. 

There  are  RKHS  based  approaches  for  defining  kernels  on  unordered  sets  as  well.  The  method 
proposed  by  Smola  et  al.  [155]  uses  the  interaction  between  pairs  in  the  sample  set,  and  hence  its 
computation  time  is  0(rri2).  The  divergence  estimator  we  propose,  by  contrast,  uses  only  KNN 
distances  in  the  sample  set,  a  well-studied  problem  with  efficient  solutions  such  as  k-d  trees.  Note 
also  that  choosing  an  appropriate  kernel  function  for  the  RKHS  can  be  a  difficult  model  selection 
problem,  a  challenge  not  faced  by  our  proposed  divergence  estimator. 

Sricharan  et  al.  [157]  developed  A-ncarcst-ncighbor  based  methods  similar  to  our  method  for 
estimating  non-linear  functionals  of  the  density,  of  which  divergences  are  a  special  case.  In  contrast 
to  our  approach,  however,  their  method  requires  k  to  increase  with  the  sample  size  N  and  diverge 
to  infinity.  KNN  computations  for  large  k  values  can  be  very  computationally  demanding.  In  our 
approach  we  fix  A;  on  a  small  number  (typically  between  1  and  5),  and  are  still  able  to  prove  that 
the  divergence  estimator  is  consistent. 

Jebara  and  Kondor  [76]  have  also  studied  the  question  of  how  to  define  kernels  on  distributions. 
Their  approach  fits  a  parametric  family  {e.g.  exponential  family)  density  to  each  set  of  points,  and 
then  using  these  fitted  parameters  estimates  the  inner  products  between  the  densities.  Moreno  et 
al.  [119]  also  fit  a  parametric  density  to  the  data  and  use  it  to  define  a  KL  divergence-based  kernel. 
Parametric  approaches  can  work  better  than  nonparametric  methods  when  the  sample  size  N  is 
small,  or  if  we  know  from  prior  knowledge  that  the  true  densities  belong  to  these  parametric  fami¬ 
lies.  When  the  assumptions  do  not  hold,  however,  parametric  methods  introduce  bias  in  estimating 
the  inner  products  between  densities.  In  contrast,  our  proposed  method  is  completely  nonparamet¬ 
ric  and  provides  provably  asymptotically  unbiased  kernel  estimations  for  certain  kernels. 

Kondor  and  Jebara  [83]  earlier  introduced  a  kernel  between  distributions  defined  as  Bhat- 
tacharyya’s  measure  of  affinity  between  finite  dimensional  Gaussians  in  a  Hilbert  space.  This 
approach  fits  a  Gaussian  distribution  to  the  features  in  a  Hilbert  space,  but  it  can  lead  to  a  large 
bias  when  the  data  in  the  Hilbert  spaces  is  not  Gaussian.  Furthermore,  the  approach  is  developed 
only  for  Bhattacharyya’s  measure.  Our  proposed  method  is  asymptotically  unbiased  and  can  be 
used  for  many  other  divergences. 

The  Pyramid  Matching  Kernel  [63],  which  also  operates  over  unordered  sets,  has  recently  be¬ 
come  popular  in  computer  vision.  In  this  approach  each  feature  set  is  mapped  to  a  multi-resolution 
histogram.  These  histogram  pyramids  are  compared  using  a  so-called  “weighted  histogram  inter¬ 
section  computation.”  A  shortcoming  of  this  approach  is  that  it  needs  to  calculate  79-dimensional 
histograms,  which  can  become  very  inefficient  for  large  D  due  to  the  curse  of  dimensionality.  Se¬ 
lecting  appropriate  bin  sizes  is  also  a  difficult  problem  for  which  only  heuristics  are  known  [150]. 

Poczos  et  al.  [130]  used  a  slightly  less  general  version  of  our  nonparametric  divergence  esti¬ 
mator  similar  to  solve  certain  machine  learning  problems  in  the  space  of  distributions.  That  work 
studied  only  simple  KNN  based  classifiers,  however.  Here  we  use  kernel  methods  that  are  more 
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discriminative  in  classification  tasks,  and  evaluate  their  performance  on  various  image  datasets. 


5.3  Problem  Definition 

In  this  section  we  formally  define  our  set  classification  problem  and  show  how  kernel  classifiers 
can  be  generalized  to  sample  sets  of  distributions.  Assume  we  have  M  inputs  {G/, . . . ,  GM}  each 
representing  one  image,  where  the  mth  input  Gm  contains  i.i.d.  samples  from  some  underlying 
density  fm.  That  is,  Grn  is  a  set  of  sample  points,  and  xmj  ~  fm  for  j  =  1, . . . ,  Nm.  Let  Q  denote 
the  set  of  all  such  sample  sets  i.e.  Gm  E  G,  m  —  1, . . . ,  M. 

Further  assume  we  are  given  M  labels  for  these  inputs  {(Gm,ym)}t f=1-  Here  ym  E  y  = 
{yi, ,  yc}  denotes  the  class  label  of  the  mth  set.  We  seek  a  function  h  :  Q  — >■  y  such  that  for 
a  new  input  and  output  pair  (G,  y)  E  Q  x  y  we  ideally  have  that  h(G)  =  y.  For  simplicity,  we 
discuss  only  binary  classification.  The  ideas  below  can  be  extended  to  c-class  classification  in  the 
standard  ways. 

SVM  is  one  the  most  successful  methods  in  estimating  such  functions.  In  order  to  use  SVM, 
we  need  to  be  able  to  evaluate  the  kernel  between  the  inputs.  In  our  case,  we  need  a  kernel  function 
on  Q  x  Q  that  returns  real  values.  Once  we  evaluated  such  kernels  and  obtained  the  kernel  matrix 
a.k.a.  Gram  matrix,  existing  SVM  algorithm  can  be  used  for  classification.  Having  the  kernel 
matrix,  we  can  also  accomplish  many  other  learning  tasks  on  sets  such  as  clustering  using  spectral 
clustering[122],  dimensionality  reduction  using  kernel  PCA  [147],  anomaly  detection  using  one- 
class  SVM  [148],  and  so  on.  All  of  these  urge  us  to  find  a  good  kernel  matrix  for  the  groups. 

5.4  Nonparametric  Kernel  Estimation 

Having  two  finite  i.i.d.  sample  sets  from  densities  /i  and  /2,  we  need  to  estimate  k(  j\ ,  /2),  the 
kernel  value  between  them.  Many  kernels,  i.e.  positive  semi-definite  (PSD)  functionals  of  /i  and 
f2  can  be  constructed  from 

DaAfiWh)  =  I  f?(x)fg(x)fi(x)6x,  (5.1) 

where  a,  (3  E  M.  For  example,  we  can  use  Eq.  (5.1)  to  construct  Linear  {k(f\ ,  f2)  =  f  f if 2),  poly¬ 
nomial  (k(fi,q)  =  (f  /1/2 +  c))s),  and  Gaussian  ( k(f1,f2 )  =  exp(-4/U2(/1,  /2)/cr2),  /i(/i,  f2)  = 

f  /12  +  fl  ~  2/1/2)  kernels. 

For  the  Gaussian  kernel,  which  we  primarily  use  in  this  chapter,  one  can  also  use  other  “dis¬ 
tances”.  For  example,  we  can  use  the  Hellinger  distance  with  p(fi,  f2)  =  1  —  f  \J f\fi-  Another 
important  family  of  divergences  is  the  Renyi-a  divergence ,  where 

M/1,/2)  =  ^^log  [ 
a-1  J 

Note  that  the  KL-divergence  is  a  special  case  of  the  Renyi  divergence  when  a  — >  1.  These  diver¬ 
gences  are  nonnegative  and  vanish  iff  p  =  q  almost  surely.  Nonetheless,  the  divergences  are  usually 
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not  symmetric,  do  not  satisfy  the  triangle  inequality,  and  do  not  lead  to  PSD  kernel  matrices.  In 
Section  5.5  we  will  show  how  to  address  this  problem. 

To  estimate  Dap(fi\\f2)  for  some  a,  /3  values,  we  use  the  tools  that  have  been  applied  for  Renyi 
entropy  [96],  Shannon  entropy  [61],  KL  divergence  [168],  and  Renyi  divergence  estimation  [129]. 
We  show  how  to  estimate  Da^(fi\\f2)  in  an  efficient,  nonparametric,  and  consistent  way. 

Let  G i  =  { X\ , . . . ,  xNl}  be  an  i.i.d.  sample  from  fi,  and  similarly  let  G2  =  {zi, . . . ,  zN2}  be 
an  i.i.d.  sample  from  f2.  Let  pk(i)  denote  the  Euclidean  distance  between  xt  and  its  A  th  nearest 
neighbor  in  G\,  and  similarly  let  Vk(i)  denote  the  distance  between  Xi  and  its  the  Acth  nearest 
neighbor  in  G2.  Based  on  [130],  we  can  use  the  following  estimate 
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where  Bk,a,p  =  c~a~/3 m.  — jy ■  Under  certain  conditions,  we  can  prove  that  D,,p  is  a  con¬ 
sistent  estimator  of  Da4 3,  and  thus  by  plugging  these  estimators  into  kernels  we  get  consistent 
estimators  for  those  kernels.  It  means  that  the  more  sample  points  we  have  the  better  the  quality  of 
the  kernel  estimation  is,  and  eventually  it  is  converging  to  the  correct  value. 

To  compute  the  estimate  (5.2),  all  we  need  are  the  KNN  distances  Pk(i)  and  uk(i)  for  every 
point  Xi  in  group  G\.  In  low  dimensions,  nearest  neighbors  can  be  found  in  logarithm  time  using 
tree  structures  such  as  the  KD-Tree,  resulting  in  a  time  complexity  of  0  [N  log(iV))  for  one  pair  of 
groups,  where  N  is  around  the  average  size  of  the  groups.  In  high  dimensions,  however,  efficient 
search  for  neighbors  becomes  difficult  and  generally  we  can  only  examine  the  points  one  by  one 
using  linear  time,  resulting  in  a  quadratic  time  complexity  0(N2).  Since  we  have  to  compute  for 
each  pair  of  groups  the  estimated  kernel  to  use  kernel  machines,  the  overall  complexity  becomes 
0(M2N2).  As  a  remedy,  we  can  parallelize  the  computations  of  different  pairs  of  groups.  Another 
solution  is  to  use  approximate  nearest-neighbor  search  algorithms  such  as  [121].  More  discussions 
and  solutions  to  the  efficiency  problem  can  be  found  in  Chapter  7. 

This  estimator  can  also  work  on  groups  with  different  sizes  and  the  consistency  result  still 
holds.  However,  since  larger  sample  size  tends  to  give  more  accurate  estimate,  in  theory  working 
with  groups  of  different  sizes  might  give  estimates  of  different  qualities  in  the  same  kernel  matrix. 
Nevertheless,  in  practice  we  found  this  is  usually  not  a  problem;  We  show  empirical  results  in 
Section  5.6  on  groups  of  similar  sizes  as  well  as  groups  of  very  different  sizes. 


5.5  Constructing  Mercer  Kernels 

Kernels  constructed  from  DaJ 3  are  not  ready  to  be  plugged  into  kernel  machines  like  SVM.  Even 
though  the  estimation  is  consistent,  any  particular  estimated  kernel/Gram  matrix  might  not  be 
positive  semi-definite  (PSD),  which  is  required  by  SVM.  There  are  two  reasons  for  this  problem: 
1)  the  divergences  themselves  might  not  be  Hilbertian  metrics  which  is  a  necessary  condition  of 
producing  PSD  kernels  [145];  2)  Estimation  errors  exist  given  the  finite  sample  size.  We  therefore 
need  to  transform  the  raw  estimated  kernel  matrix  into  a  PSD  matrix  so  that  the  underlying  kernel 
is  a  valid  Mercer  kernel. 
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Here  we  project  the  raw  kernel  matrix  to  the  cone  of  PSD  matrices,  and  use  that  projected 
image,  which  is  a  PSD  matrix,  as  the  input  to  SVM.  In  other  words,  we  are  seeking  for  the  PSD 
matrix  that  can  best  approximate  the  raw  kernel  matrix.  To  do  this,  we  first  symmetrize  the  esti¬ 
mated  kernel  matrix  by  taking  half  the  sum  of  it  and  its  transpose,  and  then  project  it  to  the  cone  of 
PSD  matrices  by  discarding  any  negative  eigenvalues  ( i.e .  setting  them  to  zeros)  from  its  spectrum 

[71] 

Rather  than  projecting  the  estimated  kernel  matrix  and  then  solving  an  SVM,  one  can  actually 
combine  these  two  steps  into  a  single  convex  problem  [108].  We  do  not  pursue  this  approach  in 
this  work,  however. 

When  structures  exist  in  the  kernel  matrix  (e.g.  the  kernel  matrix  has  a  low  rank),  we  can  find 
better  ways  to  construct  PSD/Mercer  kernels  based  on  the  raw  estimations.  This  direction  is  further 
studied  in  Chapter  6. 

We  could  also  estimate  distribution  divergences  that  are  Hilbertian  metrics,  and  then  use  them 
to  construct  kernels.  By  doing  this,  we  only  have  to  deal  with  the  estimation  errors,  and  may 
obtain  higher  quality  kernel  matrices.  [69,  70]  proposed  a  family  of  Hilbertian  metrics  between 
probability  distributions.  These  metrics  can  also  be  estimated  using  similar  techniques  as  in  (5.2). 
The  consistency  of  such  a  family  of  estimates  is  yet  to  be  studied,  but  several  interesting  special 
cases  of  this  family,  notably  the  Jensen-Shannon  divergence ,  either  coincide  with  the  divergences 
mentioned  in  the  previous  section,  or  can  be  derived  from  (5.2)  and  [168].  We  shall  leave  this 
possibility  for  the  future  work. 


5.6  Experiments 

In  this  section,  we  show  the  empirical  performance  of  the  proposed  kernels  in  both  simulation 
studies  and  real-world  image  classification  tasks.  Code  and  datasets  used  here  are  available  at 
autonlab . org/ autonweb/ 20680. html. 

In  all  these  tasks,  the  objects  of  interest  are  represented  as  “bags  of  vectors”  (BoV),  i.e.  un¬ 
ordered  sets  of  feature  vectors.  The  proposed  kernel  estimators  as  well  as  several  other  kernels 
between  sets  of  points  are  used  to  calculate  kernel  matrices  for  these  sets.  The  full  kernel  ma¬ 
trices  are  projected  to  be  symmetric  positive  semi-definite  and  given  to  a  multi-class  SVM  for 
classification. 

Nonparametric  divergence  kernels  These  kernels  are  based  on  the  proposed  nonparametric 
Rcnyi-o  divergence  estimators  (NPR-a)  and  Hellinger  distance  estimators  (NPH).  We  use  the 
k  =  5th  nearest  neighbors  in  these  estimators,  except  in  Section  5.6.1,  where  small  sample  sizes 
necessitate  k  =  1.  For  NPR,  we  test  the  performance  with  a  e  {0.5,  0.7,  0.9,  0.99}.  Note  that 
when  a  =  0.99  the  Renyi-divergence  approximates  the  KL  divergence,  and  when  a  =  0.5  it  is 
twice  the  Bhattacharyya  distance. 

Parametric  kernels  These  kernels  are  based  on  a  Gaussian  or  Gaussian  Mixture  Model  (GMM) 
assumption.  We  first  fit  the  density  to  each  group,  and  then  compute  the  KL-divergence  (G-KL, 
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GMM-KL)  [119]  and  product  probability  kernels  (G-PPK,  GMM-PPK)  [76]  with  a  =  0.5  be¬ 
tween  the  groups  (therefore  they  are  actually  the  Bhattacharyya  Coefficients  between  Gaussians). 
Tuning  the  number  of  GMM  components  for  each  group  is  not  feasible,  so  we  always  use  3  com¬ 
ponents.  GMM-KL  has  no  analytic  form,  so  we  use  the  Monte  Carlo  approximation  with  500 
samples. 

BoW  kernels  To  convert  BoV  to  BoW,  we  quantize  the  feature  to  “visual  words,”  and  then 
compute  the  histogram  of  words  for  each  group.  The  chi-square  distance  between  these  BoW 
histograms  is  used  to  construct  the  Gaussian  kernel.  The  histograms  can  be  further  processed  by 
PLSA  [72]  and  then  used  in  kernels  based  on  Euclidean  distance. 

Pyramid  matching  kernel  (PMK)  We  also  use  the  vocabulary- guided  pyramid  matching  kernel 
[64];  this  variant  performs  better  for  high-dimensional  data.  We  use  the  authors’  implementation 
libpmk 1  with  the  suggested  parameters. 

Mean  map  kernel  (MMK)  We  also  consider  the  mean  map  kernel  [155],  also  known  as  the  mean 
match  kernel  [109]  to  the  computer  vision  community.  The  MMK  between  two  groups  of  vectors 
Gi  =  {x1,...,  xNl}  and  G2  =  {zu  . . . ,  zN2}  is  defined  as  kMAf(Gi,  G2)  =  ESjLi  Hxh  *j) 
In  other  words,  MMK  is  the  average  kernel  matching  score  between  every  pair  of  points  be¬ 
tween  the  two  groups.  We  let  the  point- wise  matching  kernel  be  the  Gaussian  kernel  k(x,y )  = 
exp  (—  ||a:  —  y\\2/cr2),  where  the  kernel  width  a  is  tuned  in  the  same  way  as  other  parameters  us¬ 
ing  cross-validation.  To  avoid  the  high  computational  cost  of  MMK  (0(N\N2)  for  each  pair  of 
groups),  we  randomly  choose  at  most  500  points  from  each  group  to  compute  the  MMK,  so  that 
the  computation  is  affordable  while  the  approximation  error  is  small. 

We  use  LibSVM  [27] ’s  multi-class  SVM  for  classification.  All  kernel  matrices  are  projected 
to  be  symmetric  PSD  as  in  Section  5.5  before  use.  The  penalty  to  points  within  the  margin  C  is 
chosen  from  {2-9,  2~6,  •  •  •  ,  218}.  For  PPK  and  PMK,  we  use  their  kernel  values  directly.  For  other 
kernels,  we  use  Gaussian  kernels  exp  (— l//2/cr2),  where  p  is  the  divergence/distance.  The  kernel 
width  a  is  chosen  from  <r0  x  {2-4,  2-2,  •  •  •  ,  210},  where  a0  is  the  mean  of  the  pairwise  divergences. 
C  and  (when  used)  a  are  chosen  through  joint  3-fold  cross-validation  on  the  training  set. 

For  the  image  experiments,  we  extract  features  as  follows  unless  indicated  otherwise.  The  BoV 
representation  we  use  is  based  on  the  dense  SIFT  descriptors.  We  put  a  regular  2D  grid  with  step 
size  10  on  each  image,  and  compute  SIFT  descriptors  on  each  grid  node.  These  descriptors  are 
128-dimensional.  In  an  attempt  for  scale  invariance,  we  usually  compute  three  SIFT  descriptors 
with  bin  sizes  of  {6,  9, 12}  pixels  at  each  point.  After  the  feature  extraction,  each  image  is  rep¬ 
resented  by  a  variable  number  of  128-dimensional  feature  vectors.  Following  [19],  we  can  also 
include  color  information  in  the  SIFT  features  by  converting  the  images  to  HSV  color  space  and 
separately  extracting  SIFT  features  from  each  color  channel.  Then  SIFT  features  with  the  same 
location  and  bin  size  are  concatenated  together  to  construct  the  more  descriptive  “color  SIFT”  fea¬ 
ture  with  dimensionality  384.  Finally,  we  use  PCA  to  reduce  the  feature  vectors’  dimensionality. 
Our  implementation  uses  the  PHOW  function  of  the  VFFeat  package  [164]  for  feature  extraction. 

'people . csail . mit . edu/ j  j 1 / libpmk 
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Figure  5.1:  Densities  of  the  two  one-dimensional  mixtures. 


For  BoW,  these  SIFT  vectors  are  quantized  by  A  - means  into  visual  words,  for  which  the  vo¬ 
cabulary  size  (number  of  clusters)  is  1000  for  color  images  and  500  for  grayscale  images.  The 
number  of  PLSA  topics  is  25,  as  in  [19].  Following  common  practice  in  computer  vision,  the  vi¬ 
sual  words  are  based  on  the  original  (uncompressed)  feature  vectors.  Therefore  the  BoW  methods 
do  not  compare  to  BoV  kernels  directly,  as  they  are  based  on  different  features.  In  comparison, 
BoW  loses  information  in  the  discretization  step,  while  BoV  kernels  lose  information  when  the 
feature  dimension  is  reduced.  We  will  show  that  our  non-parametric  kernels  outperform  BoW  in 
most  cases,  perhaps  indicating  that  less  information  is  lost  in  PCA  than  in  quantization. 

We  report  kernel  matrix  construction  times  using  40  cores  of  a  machine  with  four  12-core 
2.3  GHz  Opteron  K10.5  processors.  In  this  high-dimensional  setting,  /,  -cl  trees  are  ineffective,  so 
we  use  simple  brute-force  search.  Established  techniques  for  approximate  KNN  should  result  in 
significant  speedups  with  limited  loss  of  performance.  In  each  case,  we  estimated  divergences  for 
the  Hellinger  distance  and  Renyi-a  divergence  with  20  values  of  a:  -1,  -.5,  -.2,  .1,  .2,  .3,  . . . ,  .9, 
.99,  1.01,  1.1,  1.2,  1.3,  1.4,  1.5,  and  2. 

5.6.1  Artificial  Gaussian  Mixture  Classification 

We  first  compare  the  proposed  kernels  to  others  on  artificial  problems,  to  demonstrate  two  ad¬ 
vantages  of  our  kernel:  its  relatively  few  parameters  requiring  fine-tuning  and  its  effectiveness  in 
high-dimensional  problems. 

Consider  the  problem  of  distinguishing  between  the  two  Gaussian  mixtures  illustrated  in  Fig¬ 
ure  5.1.  The  two  mixtures  each  have  a  standard  normal  distribution  with  mixture  coefficient  jj;  the 
two  classes  are  distinguished  by  the  variance  of  the  other  component,  which  can  be  either  .005  or 
.0005.  Our  task  is  to  learn  a  classifier  which  can  distinguish  samples  of  size  30  from  these  two  mix¬ 
tures.  (Although  most  feature  sets  will  have  substantially  more  than  30  data  points  for  a  real-world 
image,  having  a  low  number  of  sample  points  parallels  having  a  moderate  number  of  sample  points 
in  a  high-dimensional  space.)  Note  that  this  problem  is  quite  difficult,  as  the  expected  number  of 
samples  from  the  distinguishing  mixture  is  below  3. 

Figure  5.2  shows  accuracies  from  8  runs  of  10-fold  cross-validation  accuracies  for  several  ker- 
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Figure  5.2:  ID  mixture  classification  accuracies. 


nels  on  a  data  set  consisting  of  200  samples  from  each  mixture.  The  BoW  method  with  codebook 
size  K  is  denoted  by  BoW- A'.  The  classification  performance  obtained  by  the  Bayes-optimal  clas¬ 
sifier  that  chooses  which  mixture  had  a  higher  likelihood  of  generating  the  sample  is  75%.  The 
BoW  kernel  performs  at  its  best  only  for  codebook  size  50;  smaller  and  larger  sizes  both  perform 
worse,  some  of  them  considerably  so.  In  contrast,  the  proposed  NPR  and  NPH  methods  perfor¬ 
in  well  with  minimal  parameter  selection,  though  it  seems  the  Renyi  divergence  is  better  for  this 
problem  than  the  Hellinger. 

We  also  show  that  our  proposed  kernel  is  capable  of  scaling  up  to  higher-dimensional  problems 
with  small  sample  sets.  This  problem  is  similar,  but  the  samples  are  of  size  15  in  R D .  The  common 
Gaussian  has  diagonal  components  1  and  off-diagonal  components  0.2,  while  the  distinguishing 
Gaussian  has  covariance  matrix  equal  to  either  Id  or  Id/ 2,  where  Id  stands  for  the  d-dimensional 
identity  matrix.  Each  component  has  mean  zero  and  mixture  coefficient  1/2.  The  distributions  are 
more  distinguishable  in  higher  dimensions,  as  the  components  overlap  less. 

The  results  of  16  runs  of  10-fold  cross-validation  for  several  kernels,  as  well  as  that  of  the 
Bayes-optimal  classifier,  are  shown  in  Figure  5.3.  The  proposed  NPR  method  outperformed  its 
competitors  in  this  experiment,  and  indeed  achieved  near-optimal  results  for  all  ds.  BoW  —  500 
is  the  only  BoW  method  shown,  but  other  codebook  sizes  performed  similarly.  The  dimensional¬ 
ity  at  which  performance  peaked  varied  with  the  codebook  size,  so  that  e.g.  BoW- 100  peaked  at 
dimension  8,  and  BoW- 1000  at  14. 

5.6.2  Object  Classification 

In  the  following  sections  we  compare  the  performances  of  various  kernels  on  real-world  image 
datasets.  We  first  examine  object  classification  in  the  ETH-80  [95]  data  set.  This  data  set  contains 
8  categories  of  objects;  each  category  has  10  different  objects,  and  each  object  has  41  images 
from  different  view  angles.  Following  [63],  we  use  a  subset  of  400  images  for  the  experiment, 
selecting  5  images  per  object  that  capture  its  appearance  from  different  angles.  Sample  images  of 
two  objects  are  shown  in  Figure  5.4.  Our  goal  is  to  classify  these  objects  into  the  8  categories. 

For  this  data  set,  we  extract  the  color  SIFT  features  with  bin  size  fixed  at  6  pixels,  as  scale 
invariance  is  not  necessary  for  this  problem.  We  then  reduce  the  SIFT  features  to  18  dimensions 
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Figure  5.3:  Mean  and  standard  deviation  accuracies  on  the  high-dimensional  artificial  data  set. 


using  PCA,  preserving  50%  of  variance.  Each  image  is  then  represented  by  576  18-dimensional 
points.  Constructing  our  proposed  kernels  took  47  seconds. 

We  report  the  performance  of  16  random  runs  of  2-fold  cross-validation  in  Figure  5.5.  We 
can  see  that  our  Renyi-divergence  kernels  perform  better  than  BoW,  and  much  better  than  the 
other  methods.  We  note  that  BoW  achieved  impressive  results  only  when  properly  tuned,  as  in 
the  simulation  study  of  Section  5.6.1.  The  improvement  of  NPR-0.9  (mean  accuracy  90.9%)  over 
BoW  (88.3%)  is  statistically  significant:  a  paired  t-test  shows  a  p-value  below  10-3.  It  is  also 
interesting  to  see  that  GMM-based  methods  perform  worse  than  simple  Gaussian-based  methods. 
This  may  be  because  it  is  harder  to  choose  the  parameters  of  a  GMM,  or  because  divergences 
between  GMMs  could  not  be  obtained  precisely;  both  of  those  problems  are  infeasible  to  remedy. 
PMK  is  not  very  accurate  here,  though  fast  to  compute. 

Figure  5.6  shows  the  performance  of  the  Rcnyi-o  kernel  for  many  values  of  a,  along  with 
the  Hellinger  performance  for  context.  The  best  a  values  are  clearly  near  1,  i.e.  near  the  KF 
divergence,  though  performance  seems  to  degrade  faster  when  greater  than  1  than  when  below. 

5.6.3  Scene  Classification 

Scene  classification  using  BoV/BoW  representations  is  a  well-studied  problem  for  which  many 
methods  have  been  proposed  (e.g.  [19,  50,  135]).  Here  we  test  the  performance  of  our  non- 
parametric  kernels  against  state-of-the-art  methods. 

We  use  the  OT  data  set  from  [126],  which  contains  8  outdoor  scene  categories:  coast,  forest, 
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Figure  5.4:  Images  of  two  objects  from  the  ETH-80  data  set.  Each  object  has  5  different  views. 
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Figure  5.5:  Classification  accuracies  on  ETH-80. 


highway,  inside  city,  mountain,  open  country,  street ,  and  tall  building.  There  are  2688  images  in 
total,  each  about  256  x  256  pixels.  Sample  images  are  shown  in  Figure  5.7.  The  goal  is  to  classify 
test  images  into  one  of  the  8  categories. 

We  used  the  color  SIFT  features,  and  also  append  the  relative  y  location  of  each  patch  (0 
meaning  the  top  of  the  image  and  1  the  bottom)  onto  the  local  feature  vectors,  allowing  the  use 
of  some  information  about  objects  locations  in  the  images  in  classification.  (We  chose  not  to 
include  x  coordinates,  because  horizontal  locations  of  objects  generally  carry  little  information  in 
these  scene  images).  We  used  bin  sizes  of  (6, 12, 18,  24,  30}.  The  larger  patches  are  used  so  that 
more  global  information,  such  as  the  co-occurrences  of  local  objects,  can  be  captured.  Using  the 
above  features,  a  typical  image  contains  1,  815  SIFT  vectors,  each  of  dimensionality  384;  these 
are  reduced  by  PCA  to  53  dimensions  preserving  70%  of  the  variance,  and  then  y  coordinates  are 
appended.  Each  dimension  of  the  feature  vectors  was  finally  normalized  to  have  zero  mean  and 
unit  variance.  Computing  the  nonparametric  kernels  on  these  larger,  higher-dimensional  points 
took  283,  599  seconds  (about  3  days). 

The  accuracies  of  16  random  runs  are  shown  in  Figure  5.8.  Here  results  of  10-fold  cross- 
validations  are  used  so  that  we  can  directly  compare  to  other  published  results.  GMM-PPK  is  not 
shown  because  it  is  too  low.  NPR-0.99  achieved  the  best  average  accuracy  of  92.11%,  which  is 
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Figure  5.6:  Classification  accuracies  on  ETH-80  with  Renyi-a  for  twenty  cPs,  as  well  as  the 
Hellinger  distance. 

much  better  than  BOW’s  90.26%.  Notably,  this  92.11%  accuracy  (stddev  0.18%)  surpasses  the  best 
previous  result  of  which  we  are  aware,  91.57%  [48].  For  comparison,  in  2-fold  cross-validations 
the  mean  accuracies  of  NPR-0.99  and  BOW  are  90.85%  and  88.21%  respectively. 

5.6.4  Sport  Event  Classification 

The  BoV  kernels  can  also  be  used  for  visual  event  classification  [100]  in  the  same  manner  as  for 
scene  classification.  We  use  the  data  set  from  [100],  which  contains  Internet  images  of  8  sport  event 
categories:  badminton,  bocce,  croquet,  polo,  rock  climbing,  rowing,  sailing,  and  snowboarding. 
This  data  set  is  considered  more  difficult  than  traditional  scene  classification,  as  it  involves  much 
more  widely  varying  foreground  activity  than  does  e.g.  the  OT  data  set. 

We  use  the  first  130  images  from  each  category,  as  in  [100].  We  use  color  SIFT  features 
with  dimensionality  reduced  to  57,  and  add  spatial  information  in  the  form  the  patches’  x  and  y 
coordinates.  As  image  sizes  vary,  each  BoV  group  contains  295  to  1,  542  vectors.  Constructing 
our  proposed  kernels  took  9, 327  seconds  (2.5  hours). 

Figure  5.10  shows  the  accuracies  of  16  random  2-fold  cross-validations.  We  again  see  the 
kernel  based  on  the  Renyi-.9  divergence  achieve  the  best  accuracy  of  87.1%  (std  dev  .4%).  This 
performance  is  at  the  same  level  as  state-of-the-art  methods  such  as  [181],  which  attained  86.7%. 
It  is  worth  noting  that  we  used  only  PCA  SIFT  without  further  feature  learning,  as  opposed  to 
other  methods  which  achieved  significant  performance  increases  by  learning  features.  Compared 
to  previous  results,  we  can  see  that  the  performance  of  PPK  methods  decreased;  we  did  not  show 
GMM-PPK  here  because  its  accuracy  is  too  low.  The  BoW  method,  though  worse  than  Renyi-.9 
with  83.5%,  again  performs  reasonably  well,  showing  its  wide  applicability. 

Another  interesting  observation  based  on  all  the  above  results  is  that  the  nonparametric  esti¬ 
mates  of  the  Renyi  divergences  usually  perform  the  best  when  a  is  close  to  1  i.e.  when  it  is  close  to 
the  KL  divergence.  This  can  be  viewed  as  an  empirical  support  for  the  theoretical  soundness  of  the 
KL  divergences.  On  the  other  hand,  in  many  cases  the  optimal  a  is  usually  slightly  smaller  than  1, 
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Figure  5.7:  Images  from  the  8  OT  scene  categories:  coast,  forest,  highway,  inside  city,  mountain, 
open  country,  street,  tall  building. 


showing  that  flexibility  of  the  Renyi  divergence  can  be  rewarding. 


5.7  Summary 

In  this  work  we  proposed  a  novel  discriminative  method  for  set  and  distribution  classification.  We 
defined  new  kernels  on  sets  of  vectors  and  used  consistent  nonparametric  divergence  estimators 
for  estimating  the  kernel  values.  Our  goal  was  not  to  introduce  new  features;  instead  we  were 
interested  in  improving  the  performance  of  bag  of  vectors  image  representations  through  better 
dissimilarity  measures. 

Parametric  methods  for  divergence  estimation  are  usually  biased,  since  the  true  distributions 
may  not  belong  to  assumed  parametric  families.  Our  nonparametric  divergence  estimator,  however, 
is  asymptotically  unbiased.  It  is  also  easy  to  compute,  requiring  only  certain  fc-NN  distances. 

For  bag-of-words  methods,  setting  the  appropriate  codebook  size  is  a  difficult  model  selection 
problem.  It  is  similarly  unknown  how  to  choose  the  bin  sizes  for  histogram-based  methods.  Our 
algorithm  has  comparably  fewer  parameters  to  tune,  and  avoids  the  inherent  approximations  of 
histograms,  quantization,  and  clustering,  which  can  lead  to  loss  of  information  and  decreased 
performance. 

In  our  experiments,  we  demonstrated  that  the  proposed  method  can  outperform  its  state-of-the- 
art  competitors  on  several  challenging  datasets,  both  artificial  and  real. 
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Figure  5.8:  Accuracies  on  the  OT  data  set.  The  horizontal  line  shows  the  best  previously  reported 
result. 


Figure  5.9:  Images  from  the  8  sports. 


Figure  5.10:  Accuracies  on  the  Sport  data  set.  The  horizontal  line  shows  the  best  previously 
reported  result. 
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Chapter  6 

Low-Rank  Constructions  of  Mercer  Kernels 


In  this  chapter  we  describe  more  ways  of  constructing  Mercer  kernels  based  on  estimated  set 
kernels.  Unlike  the  approach  used  in  Chapter  5  which  simply  seeks  best  approximations,  methods 
in  this  Chapter  further  exploit  the  low-rank  structures  that  might  exist  in  the  divergence/kernel 
matrices.  By  using  these  structures,  we  are  able  to  construct  higher-quality  kernel  matrices,  make 
the  subsequent  SVM  training  faster,  cope  with  missing/unreliable  kernel  values,  and  make  the 
computation  faster. 


6.1  Introduction 

Many  learning  algorithms  for  collective  data  is  based  on  pairwise  similarities  between  groups.  The 
advantage  of  this  approach  is  evident:  once  we  have  the  similarities,  many  excellent  off-the-shelf 
algorithms  such  as  the  k-nearest-neighbors,  SVM,  spectral  clustering,  and  so  on  can  be  used  to 
accomplish  various  learning  tasks.  In  this  chapter,  we  focus  on  the  kernel  methods. 

In  Chapter  5,  we  described  a  large  class  of  divergences  that  can  be  used  to  measure  the  dis¬ 
similarity  between  groups  of  points.  Traditional  set  distances  such  as  the  Hciusdorff  distance  can 
also  be  used  for  learning  purposes.  For  a  brief  survey  refer  to  Section  5.2.  One  shortcoming  for 
many  of  these  divergences  is  that  they  do  not  behave  like  Euclidean  distances;  they  do  not  satisfy 
the  triangle  inequality,  and  they  even  do  not  provide  symmetry.  Moreover,  these  similarities  might 
have  been  estimated  as  in  Chapter  5,  and  thus  are  mixed  with  estimation  errors.  Consequently, 
the  kernels  (most  commonly  Gaussian  kernels)  based  on  these  divergences  are  not  valid  Mercer 
kernels,  and  thus  cannot  be  directly  applied  in  many  kernel  methods  such  as  the  SVM. 

To  address  this  problem,  we  take  the  approach  of  replacing  the  raw  kernel  matrix  construct¬ 
ed  from  the  divergences  by  a  refined  kernel  matrix  that  is  positive  semi-definite  (PSD)  and  thus 
corresponding  to  a  valid  Mercer  kernel.  Chapter  5  did  this  by  using  the  PSD  projection  method, 
in  which  the  refined  kernel  matrix  is  the  best  L-2  approximation  to  the  raw  kernel  matrix.  That 
method  is  purely  based  on  numerical  approximation.  In  the  following,  we  shall  show  that  when 
structures  exist  in  the  kernel  matrix,  more  effective  methods  can  be  used  to  construct  the  refined 
kernel  matrix. 

The  structure  we  exploit  in  this  Chapter  is  that  the  kernel  matrices  are  often  approximately 
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low-rank.  A  low-rank  matrix  can  be  reconstructed  by  the  linear  combination  of  a  few  row/column 
basis  vectors.  Another  useful  interpretation  is  that  the  objects  can  be  embedded  as  points  in  some 
low-dimensional  Hilbert  space,  so  that  the  inner-products  between  the  points  equals  the  kernels 
between  the  objects.  In  machine  learning,  the  low-rank  structure  of  kernel  matrices  has  been  used 
to  accelerate  the  training  of  SVM,  or  enhance  the  kernels’  discriminative  power  when  combined 
with  supervision.  In  our  work,  we  use  the  low-rank  technique  to  construct  valid  kernels  from 
raw  estimations.  With  a  low-rank,  the  refined  matrix  can  be  more  robust  against  errors  in  the  raw 
matrix,  and  make  the  subsequent  SVM  training  more  efficient. 

In  addition  to  construct  the  refined  kernel  matrices  directly,  we  can  also  construct  refined  di¬ 
vergence  matrices  that  will  lead  to  valid  kernel  matrices  when  converted  to  the  Gaussian  kernels. 
Again  this  can  be  achieved  by  embedding  the  objects  into  some  low-dimensional  Euclidean  space, 
so  that  the  distances  between  the  points  are  close  to  the  divergences  between  the  objects. 

The  low-rank  methods  are  also  useful  in  dealing  with  missing  data  and  accelerating  the  com¬ 
putations.  In  low-rank  methods,  the  degrees-of-freedom  in  the  matrices  are  limited,  and  thus  the 
entries  become  redundant  and  the  missing  ones  can  be  inferred  based  on  the  observed  ones.  This 
method  can  also  be  used  to  speed  up  the  computation  of  kernel  matrices  for  collective  data;  in¬ 
stead  of  computing  the  kernel  between  every  pair  of  groups,  we  can  skip  some  pairs  and  still  get  a 
high-quality  kernel  matrix. 

In  the  experiments,  we  examined  the  performance  of  the  refined  kernels  constructed  from  the 
raw  kernel  and  the  raw  divergence  matrices  on  image  classification  tasks.  They  may  both  provide 
superior  results  than  the  PSD  projection  method  in  Chapter  5  depending  on  the  data.  We  also 
tested  their  effectiveness  in  the  presence  of  missing  entries  in  the  kernel/divergence  matrices,  and 
find  that  the  kernel  matrix  based  method  can  achieve  good  classification  accuracies  using  a  small 
number  of  kernel  evaluations. 

The  rest  of  this  chapter  is  organized  as  follows.  Related  work  on  described  in  Section  6.2.  In 
Section  6.3  we  describe  how  to  complete  a  kernel  matrix,  and  Section  6.4  shows  how  to  complete 
a  divergence  matrix.  In  Section  6.6  we  examine  the  performances  of  different  approaches,  and 
finally  we  make  our  conclusion  in  Section  6.7. 


6.2  Related  Work 

The  low-rank  structure  of  kernel  matrices  has  been  frequently  assumed  and  used  in  machine  learn¬ 
ing.  A  number  of  papers  have  used  low-rank  approximations  of  the  kernel  matrix  to  accelerate  the 
training  of  SVM.  [51]  described  how  to  train  SVMs  efficient  when  given  low-rank  kernel  matrices. 
[47]  proposed  and  analyzed  sampling  based  method  that  can  construct  low-rank  approximation- 
s  efficiently.  [87]  proposed  an  alternative  way  to  optimize  low-rank  approximations  efficiently. 
These  methods  operate  on  valid  kernel  matrices  already  and  the  goal  was  to  accelerate  the  training 
of  SVMs.  On  the  other  hand  we  are  constructing  kernels  from  noisy  non-PSD  matrices.  But  indeed 
common  techniques  can  be  used,  and  the  low-rank  matrices  we  construct  can  also  be  used  to  train 
SVMs  efficiently. 

Other  works  were  also  proposed  to  leam  the  low-rank  kernel  matrix  using  heuristics  or  super¬ 
vised  information  enhance  the  kernels.  [169]  learns  a  PSD  matrix  under  neighborhood  constraints 
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while  maximizing  the  embedding’s  variance.  By  doing  this,  the  data  can  be  “spreaded  out”  to 
create  clear  visualizations.  [7]  incorporated  class  labels  to  increase  the  discriminative  power  of  the 
kernels.  These  enhancements  can  also  be  used  for  our  purposes  of  constructing  valid  kernels. 

The  classical  technique  of  creating  Euclidean  embeddings  based  on  given  distances  is  the  met¬ 
ric  multidimensional  scaling  (MDS)  [35].  MDS  has  been  studied  for  decades  and  many  variations 
has  been  created  to  deal  with  different  problems.  The  method  we  use  in  this  chapter  is  based  on 
metric  MDS,  and  specifically  tuned  for  the  divergences  proposed  in  Chapter  5.  The  effectiveness 
of  different  settings  are  also  empirically  evaluated. 

Several  papers  have  been  proposed  to  learn  the  kernel  matrices  given  missing  data.  [62]  used 
semi-definite  programming  (SDP)  to  leam  the  complete  PSD  matrix  based  on  partially  observed 
kernel  matrix,  and  [4]  used  an  alternative  constrained  optimization  to  obtain  sparse  PSD  matrices. 
Both  methods  can  not  handle  noisy  and  non-PSD  raw  matrices,  which  are  what  we  are  facing  in 
this  chapter.  [1]  described  a  way  to  ignore  some  entries  in  the  kernel  matrix,  but  their  method  need 
to  know  the  full  kernel  matrix  beforehand. 


6.3  Constructing  Low-Rank  Kernels 

First  we  define  our  problem.  Suppose  that  we  have  observed  a  kernel  matrix  K  £  MMxM,  where 
each  entry  ki3  =  kiCk,.  Gfi)  is  the  kernel  value  for  the  (i,j) th  pair  of  groups.  We  mainly  consider 
the  kernels  constructed  from  estimated  divergences  defined  in  Section  5.4,  but  the  techniques  below 
are  applicable  to  general  kernel  matrices.  This  raw  kernel  matrix  K  might  have  been  derived  from 
non-metric  divergences,  and  the  kernel  values  might  be  inaccurate  due  to  estimation  errors.  As  a 
result,  K  is  not  PSD  and  hence  not  a  valid  Mercer  kernel  to  be  used  in  kernel  machines.  Our  goal 
is  to  find  a  refined  kernel  matrix  K  £  M MxM  so  that  K  is  close  to  K  and  is  PSD. 

By  definition,  K  is  PSD  iff  it  can  be  decomposed  as  K  =  UUJ ,  U  £  WxM ,  where  U  is  the 
factor  matrix  and  l  is  the  rank  of  K.  A  brief  introduction  of  matrix  factorizations  can  be  find  in 
Section  3.1.1.  The  columns  of  U,  which  are  denoted  as  (uj}i=1  M,  can  be  interpreted  as  points 
in  a  /-dimensional  space,  where  the  points  are  the  embeddings  of  the  groups.  The  inner-product 
between  two  points  equals  their  corresponding  groups’  kernel  value  as  u/'u7-  =  kiG,.  Gj ). 

A  good  K  should  be  close  to  K,  so  we  want  to  minimize  the  element-wise  difference  between 
K  and  K.  To  cope  with  missing  or  unreliable  entries,  we  further  weigh  the  errors  on  different 
entries  differently  by  a  weight  matrix  W  =  {wij}i  .=1  M  £  MMxM.  To  reduce  the  degrees-of- 

freedom,  we  constrain  the  rank  /  of  K,  as  well  as  minimize  the  norm  of  U.  All  these  terms  can  be 
summarized  by  the  following  optimization  problem  which  we  call  the  low-rank  kernel  construction 
(LRKC)  problem: 

U  =  argma xVwy  (ufui  -  +  -M|U|||  (6.1) 

ueR*xM  i,j 

where  ||  •  ||  F  is  the  Frobenius  norm,  A  is  the  penalty  on  the  norm  of  U,  and  U  is  the  factor  matrix  for 
K.  The  weight  matrix  W  controls  the  importance  of  the  entries  in  K.  Entries  with  zero  weight  are 
ignored  in  the  optimization  and  are  thus  regarded  as  missing.  This  problem  can  easily  be  solved 
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by  local  descend  algorithms  such  as  gradient  descend  and  L-BFGS.  Upon  convergence,  the  refined 
kernel  matrix  can  be  obtained  by  K  =  U7  U. 

We  choose  to  minimize  ||U||fi  because  it  equals  the  nuclear  norm  of  X  (the  sum  of  singular 
values  of  X) ,  which  is  a  good  surrogate  of  the  its  rank  [138].  Therefore,  we  can  gain  more  control 
over  K’s  complexity  by  penalizing  ||U||^.  Usually,  we  let  the  rank  /  be  a  relatively  large  value  and 
control  the  complexity  of  K  by  varying  A. 

6.4  Constructing  Low-Rank  Divergences 

The  refined  kernel  matrices  are  based  on  the  raw  kernel  matrices,  which  are  derived  from  diver¬ 
gences.  Therefore,  once  the  kernel  parameters  (e.g.  the  width  of  the  Gaussian  kernel)  change,  we 
have  to  refine  the  kernel  matrix  again.  This  problem  can  be  solved  by  constructing  low-rank  dis¬ 
tance  matrices  instead.  In  addition,  refined  distances  behave  different  from  refined  kernels,  and 
may  lead  to  better  learning  performances. 

Formally,  given  a  raw  divergence  matrix  D  e  MMxM  for  M  groups,  we  want  to  find  a  refined 
distance  matrix  D  e  MMxM,  so  that  D  can  lead  to  a  PSD  Gaussian  kernel  matrix.  The  following 
Lemma  1  reveals  a  way  of  satisfying  this  requirement.  That  is,  any  distance  function  that  leads  to 
PSD  Gaussian  kernels  can  be  realized  in  some  real  Hilbert  space.  Another  way  of  arriving  at  this 
is  that,  since  the  a  distance  d  is  conditional  negative  definite  (CND)  iff  e~Xd,  A  >  Ois  PSD  [146], 
and  any  non-negative  symmetric  CND  matrix  with  a  zero  diagonal  must  be  a  squared  Euclidean 
distance  matrix  [145],  therefore  e  Xd,  A  >  0  is  PSD  iff  d  is  a  Euclidean  distance.  In  short,  the 
refined  distance  matrix  D  should  be  a  Euclidean  distance  matrix. 

Lemma  1  (Embeddability  and  PSD  Functions  [145]).  A  necessary  and  sufficient  condition  that 
a  separable  space  with  a  distance  function  that  is  non-negative,  symmetric,  and  discernible,  be 
isometrically  embeddable  in  the  real  Hilbert  space,  is  that  the  family  of  functions  e~xt  ,  A  >  0  be 
positive  definite,  where  t  corresponds  to  the  distances. 

Meanwhile,  the  rank  of  the  Euclidean  distance  matrix  is  constrained  by  Lemma  2.  It  shows  that 
to  find  a  low-rank  Euclidean  distance  matrix  D  we  only  need  to  find  a  low-dimensional  embedding 
of  the  groups. 

Lemma  2  (Rank  of  the  Euclidean  Distance  Matrix  [46]).  Let  D  be  a  squared  Euclidean  distance 
matrix  for  l-dimensional  points.  Then  the  rank  of  D  is  at  most  l  +  2. 

The  above  reasonings  lead  us  to  a  formulation  of  finding  low-rank  distance  matrices  that  is  sim¬ 
ilar  to  finding  low-rank  kernel  matrices  in  the  previous  section.  Concretely,  we  seek  for  U  e  M.1 x  m 
the  low-dimensional  embedding  of  the  groups  via  the  following  low-rank  distance  construction 
(LRDC)  problem 

U  =  argmaxT^  Wij  (Df  —  ||uj  —  u^l^)2  •  (6.2) 

uet'xM  n  ' 

Similar  to  finding  low-rank  kernel  matrices,  the  weight  matrix  W  =  {vjlf\i  j=|  M  gives  the 
importance/presence  of  the  entries,  and  u,  the  / 1 h  column  of  U  is  the  embedding  of  the  z th  group. 
Again,  this  problem  can  be  solved  by  gradient  descend  algorithms.  Once  having  the  embedding 
U,  we  can  calculate  the  refined  distance  matrix  D  that  gives  a  PSD  kernel  matrix. 
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Note  the  presence  of  the  power  parameter  r,  which  transforms  the  elements  of  the  matrices  be¬ 
fore  computing  their  differences.  This  is  important  because  Euclidean  distances  and  the  divergence 
estimates  may  behave  very  differently  {e.g.  the  KL-divergence  can  go  to  infinity  easily;  in  practice 
we  found  that  the  estimators  in  Chapter  5  tend  to  underestimate  large  divergences),  and  it  may 
be  difficult  to  match  them  directly  even  when  using  high-dimensional  embeddings.  Intuitively,  a 
small  r  <  1  emphasizes  the  approximation  for  small  divergences,  while  a  large  r  >  1  emphasizes 
large  divergences. 

Problem  (6.2)  essentially  solves  the  metric  multidimensional  scaling  (MDS)  [35]  problem. 
Indeed,  if  r  =  1  then  (6.2)  is  MDS  with  the  stress  objective,  and  if  r  —  2  it  is  MDS  with  the 
s-stress  objective.  It  is  interesting  to  note  that  we  arrive  at  MDS  from  the  initial  motivation  of 
finding  a  low-rank  divergence  matrix  that  leads  to  PSD  Gaussian  kernels.  Compared  to  traditional 
MDS,  our  formulation  uses  the  parameter  r  to  accommodate  different  divergences. 


6.5  Discussion 

Section  6.3  and  6.4  described  ways  of  finding  low-rank  kernel  and  distance  matrices  as  the  refine¬ 
ment  of  the  raw  kernel  and  divergence  matrices.  These  refined  matrices  will  given  us  PSD  kernel 
matrices  that  can  be  used  in  various  kernel  machines. 

Both  problems  (6.1)  and  (6.2)  are  non-convex  and  local  minima  might  exist.  Therefore,  finding 
a  good  starting  point  is  important.  For  the  kernel  matrix  construction  problem  (6.1),  note  that  when 
W  =  1,  A  =  0  the  global  optimum  can  be  obtained  by  either  SVD  or  eigen-decompositions.  In 
this  work  we  shall  use  this  particular  solution  to  initialize  the  optimization.  For  the  distance  matrix 
construction  problem  (6.2),  we  can  use  the  solutions  of  the  classical  MDS,  which  also  admits 
global  optima,  as  the  initializations. 

Missing  entries  can  easily  be  handled  by  setting  the  corresponding  entries  in  the  weight  matrix 
W  to  zero.  By  doing  this,  the  missing  entries  will  be  effectively  ignored  in  the  objective  function 
and  the  embedding  U  is  derived  only  based  on  the  observed  entries.  When  initializing  the  opti¬ 
mization  using  eigen-decomposition  or  classical  MDS,  we  set  the  missing  entries  to  the  average 
kernel/divergence  value. 

The  ability  to  cope  with  missing  entries  provides  us  with  a  way  of  speed  up  the  construction  of 
the  kernel  matrix.  Even  though  the  divergence  computation  only  needs  KNN  statistics,  comparing 
two  groups  is  still  slow  relative  to  comparing  two  vectors.  Instead  of  computing  every  entry  in  the 
divergence  matrix,  we  can  intentionally  skip  some  of  them  and  let  the  low-rank  construction  infer 
them.  This  is  possible  because  the  low-rank-ness  greatly  reduces  the  degrees-of-freedom  in  the 
divergence/kemel  matrices,  hence  the  information  carried  by  the  entries  become  redundant  and  we 
can  impute  the  missing  entries.  The  nature  of  this  imputation  is  the  same  as  that  of  Chapter  2. 

We  can  use  different  ways  to  determine  the  suitable  rank  l  in  order  to  balance  the  approximation 
accuracy  and  the  ability  to  filter  out  the  noisy  and  infer  the  missing  entries.  For  the  kernel  matrix, 
we  can  directly  use  the  number  of  dominant  eigenvalues  to  guess  the  rank  as  in  PCA,  while  using  A 
to  further  control  the  model  complexity.  As  for  the  divergence  matrix,  according  to  Femma  2,  we 
can  look  at  the  singular  values  of  D2  to  determine  a  sensible  rank.  The  above  indications  can  be 
used  as  guidelines  to  guess  the  rank  of  the  refined  matrices,  but  in  general  cross-validation  should 
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be  used  to  determine  the  best  choice  of  parameters. 


6.6  Experiments 

In  this  section,  we  shall  evaluate  the  empirical  performances  of  the  refined  kernel  matrices  con¬ 
structed  by  the  low-rank  methods.  First  we  use  synthetic  data  sets  to  demonstrate  them,  and  then 
test  their  effectiveness  in  image  classification  tasks. 

6.6.1  Synthetic  Data 

We  synthesize  a  toy  data  set  using  the  following  steps.  We  randomly  choose  M  =  100  points  in 
the  2D  space  half  from  the  Gaussian  distribution  J\f  (—3, 1)  and  half  from  JV  (3, 1),  and  use  them 

calculate  the  groundtruth  distance  matrix  D*  and  kernel  matrix  K*  =  exp  vyr  j  where  a  is 
the  kernel  width.  Then  we  impose  independent  Gaussian  noise  on  the  entries  of  D*  to  get  the 
noisy  raw  distance  matrix  D  as  D%J  =  D,*-  +  D*f)  where  d  controls  the  noise  level.  We 

let  the  noise  level  increase  as  the  distance  become  larger  in  order  to  emulate  the  behavior  of  the 

divergence  estimators.  The  raw  kernel  matrix  hence  is  K  =  exp  I"?)  -  P  —  0-3,  a  =  2  is  used. 
Note  that  the  noise  level  is  very  high  here,  and  the  noise  can  make  D  asymmetric. 

Our  goal  is  to  obtain  refined  kernel  matrix  K  from  the  noisy  observation  D  and  K,  so  that  K 
is  close  to  the  groundtruth  K*.  We  do  this  in  two  ways:  1)  get  refined  distance  matrix  D  from 
D  using  LRDC  (6.1),  and  then  derive  K  from  D;  2)  directly  get  refined  K  from  K  using  LRKC 
(6.2).  For  LRDC,  we  set  the  rank  l  =  2.  For  LRKC,  we  use  the  rank  that  preserves  95%  data 
variance  using  SVD  (usually  l  =  9),  and  set  the  penalty  A  =  0.  All  the  weights  are  set  to  one. 
Different  settings  of  the  divergence  transformation  r  are  tested.  To  measure  the  quality  of  recovery, 
we  compute  the  element-wise  correlation  between  K  and  K*  since  scaling  the  kernel  matrix  will 
not  affect  learning. 

The  result  of  20  random  runs  are  shown  in  Figure  6.1.  We  can  see  that  the  quality  of  the 
recoveries  are  very  high,  with  correlations  above  0.99.  Examples  of  the  matrices  are  shown  in 
Figure  6.2.  For  LRDC,  we  include  the  results  of  different  r  values  to  show  the  importance  of 
transforming  the  divergences.  Indeed,  r  being  to  small  or  too  large  will  produce  bad  results.  The 
optimal  range  for  r  seems  to  be  within  [0.2, 1],  with  a  weak  peak  at  r  =  0.7.  We  can  also  see  that 
the  results  LRDC  can  outperform  the  results  of  LRKC  given  proper  r  values. 

Next,  we  test  how  the  methods  perform  given  missing  entries.  We  randomly  pick  a  portion  p  of 
entries  (diagonal  entries  are  always  picked  to  help  bound  the  LRKC  problem)  in  the  weight  matrix 
and  set  the  rest  to  zero.  Then  we  do  the  same  test  as  above.  Figure  6.3a  and  6.3b  show  the  results 
of  picking  p  =  50%  and  p  =  20%  entries.  We  can  see  that  both  methods  still  produce  satisfactory 
results  even  if  a  majority  of  the  data  are  missing.  We  observe  that  in  this  test  the  LRDC  results  are 
significantly  better  than  the  LRKC  results.  The  reason  might  be  that  the  Gaussian  kernel  matrix 
are  inherently  full  rank,  and  with  the  presence  missing  entries  it  becomes  more  difficult  to  guess  a 
good  rank  for  LRKC.  On  the  other  hand,  we  know  that  the  optimal  rank  for  the  distance  matrix  is 
2.  It  should  also  be  noted  that  better  result  might  be  available  if  we  tune  the  A  parameter  for  LRKC. 
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Figure  6.1:  Recovery  performances  on  the  toy  data  set.  LRDC-D  and  LRDC-K  are  for  the  distance 
matrix  and  the  kernel  matrix  obtained  by  LRDC  respectively.  LRKC-K  is  for  the  kernel  matrix  ob¬ 
tained  by  LRKC.  The  X  axis  shows  different  values  of  the  parameter  r  in  LRDC.  The  correlations 
of  the  raw  distance  matrix  D  and  kernel  matrix  K  are  also  shown. 


The  optimal  r  values  for  these  two  cases  are  0.5  and  0.4  respectively,  showing  a  very  consistent 
behavior. 


6.6.2  Image  Classification 

In  this  section,  we  test  the  refined  kernels’  performances  in  real-world  image  classification  tasks. 
The  data  and  the  setup  of  experiments  here  are  the  same  as  in  Section  5.6,  therefore  we  omit  the 
details  here.  Instead  of  using  the  PSD  projection  method  to  get  the  valid  PSD  kernels  as  in  Section 
5.5,  we  use  LRDC  and  LRKC  to  accomplish  the  same  task. 

Since  the  number  of  possible  settings  is  too  large  ( e.g .  divergence  types,  divergence  estima¬ 
tors,  kernel  width,  rank  of  the  refined  matrices,  and  so  on),  we  shall  use  heuristics  and  limit  our 
attention  to  the  most  interesting  ones.  We  only  use  the  raw  estimated  KL-divergences  and  the  re¬ 
sulting  Gaussian  kernels  given  in  Chapter  5  since  it  is  the  most  common  choice  and  often  leads  to 
near-optimal  results.  For  LRKC,  we  avoid  using  cross-validation  to  select  the  kernel  width,  which 
is  computationally  demanding,  by  heuristically  setting  the  kernel  width  to  2cr°,  where  <7°  is  the 
average  divergence  from  a  group  to  its  3rd  nearest  neighbor  group.  In  these  data  sets,  finding  a 
good  guess  of  the  rank  l  becomes  increasingly  difficult  because  the  spectra  of  the  matrices  become 
highly  concentrated.  Instead,  we  found  that  l  =  100  works  well  for  smaller  data  sets  with  less 
than  M  =  1,000  groups,  and  the  l  =  150  works  well  for  larger  data  sets.  To  generate  incom¬ 
plete  divergence/kemel  matrices,  we  randomly  mark  50%  of  their  entries  as  missing  and  set  the 
corresponding  weights  to  zero. 

In  each  run,  we  use  half  of  the  groups  for  training  and  the  other  half  for  testing.  SVM  parame- 
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Figure  6.2:  Example  results  from  LRKC  and  LRDC.  K*  and  D*  are  the  groundtruths.  K  and  D 
are  the  noisy  observations.  K  and  D  are  the  low-rank  results  produced  by  LRKC  and  LRDC. 


ters  (the  slack  penalty  C  and  the  kernel  width  a  when  the  input  is  a  distance  matrix)  are  tuned  by 
3-fold  cross-validation  on  the  training  set.  We  report  the  results  of  LRKC  with  different  A’s  and 
LRDC  with  different  r’s.  Accuracies  from  10  random  runs  are  reported.  In  the  figures,  the  method 
label  “D”  denotes  the  results  from  the  PSD  projection  method  which  is  used  as  our  baseline,  and 
method  labels  with  a  “-I”  postfix  means  incomplete  data. 


ETH-80  Lirst,  we  report  the  results  on  the  ETH-80  [95]  object  recognition  data  set.  The  results 
with  both  full  and  incomplete  divergences/kernels  are  shown  in  Ligure  6.4.  We  can  see  that  LRKC 
causes  a  slight  degradation  to  the  accuracies,  while  LRDC  is  able  to  slightly  outperform  the  base¬ 
line  using  much  less  degrees-of-freedom.  On  the  other  hand,  LRKC  is  more  robust  against  missing 
entries,  showing  a  decrease  of  only  about  2%  to  the  accuracy.  The  sensitivities  to  the  parameters 
are  generally  small.  LRKC  prefers  a  smaller  A  while  the  performance  of  LRDC  slightly  peaks 
around  r  =  0.8. 


Sports  Results  on  the  Sports  scene  [100]  data  set  are  shown  in  Ligure  6.5.  On  this  more  complex 
data  set,  the  low -rank  methods  LRKC  and  LRDC  can  both  outperform  the  baseline.  The  optimal 
performance  of  LRDC  is  achieved  with  around  r  =  0.5  or  r  =  2.  Notably,  the  incomplete  LRKC 
method  achieved  the  same  performance  as  the  baseline  method  using  only  half  of  the  entries, 
and  is  again  only  2%  worse  than  itself  with  full  data.  The  incomplete  LRDC,  however,  failed  to 
maintain  its  effectiveness  facing  missing  entries.  Observe  that  on  the  data  set  the  performance  of 
the  incomplete  LRKC  is  sensitive  to  the  parameter  A,  meaning  that  its  ability  to  cope  with  missing 
entries  depends  on  a  proper  model  complexity. 
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Figure  6.3:  Recovery  performances  on  the  toy  data  set  with  missing  data.  LRDC-D  and  LRDC- 
K  are  for  the  distance  matrix  and  the  kernel  matrix  obtained  by  LRDC  respectively.  LRKC-K  is 
for  the  kernel  matrix  obtained  by  LRKC.  The  X  axis  shows  different  values  of  the  parameter  r  in 
LRDC.  The  correlations  of  the  raw  distance  matrix  D  and  kernel  matrix  K  are  also  shown. 


OT  Finally,  results  on  the  OT  [126]  data  set,  which  contains  natural  scene  images,  are  shown 
in  Figure  6.6.  Again,  we  see  that  LRDC  is  able  to  significantly  outperform  the  baseline,  and 
incomplete  LRKC  achieved  high  accuracies  that  are  less  than  1%  worse  than  the  baseline.  The 
optimal  performance  of  LRDC  is  achieved  with  around  r  =  0.5  and  r  =  2. 

Based  on  the  above  results,  we  conclude  that  LRDC  can  achieve  better  performance  than  L- 
RKC  and  the  PSD  projection  method  on  the  full  matrices  given  suitable  parameters.  On  the  other 
hand,  the  incomplete  LRKC  is  robust  against  missing  values,  usually  being  able  to  save  half  of 
the  computation  while  only  losing  2%  accuracy.  LRKC  usually  performs  similarly  as  the  PSD 
projection  on  full  matrices,  and  LRDC  can  have  difficulties  handling  missing  data. 


6.7  Summary 

In  this  chapter  we  investigated  more  ways  of  constructing  Mercer  kernels  based  on  the  kernels  esti¬ 
mators  in  Chapter  5.  Exploiting  the  low-rank  structures  of  the  raw  kernel  and  divergence  matrices, 
we  are  able  to  derive  better  kernels  for  kernel  machines. 

Both  kernel-based  and  divergence-based  approaches  are  proposed  based  on  the  low-dimensional 
embedding  of  the  groups.  The  performances  vary  depending  on  the  specific  data  set,  but  in  gen¬ 
eral  both  can  produce  better  results  than  the  PSD  projection  method  in  Chapter  5,  thanks  to  their 
robustness  against  errors  and  the  discrepancy  between  the  PSD  kernels  and  the  divergences. 

We  also  tested  the  performance  of  using  the  methods  to  speed  up  the  learning  by  skipping 
divergence  computations.  We  found  that  the  kernel-based  approach  is  very  robust  against  missing 
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Figure  6.4:  Classification  accuracy  on  the  ETH-80  data  set.  D  is  the  baseline  (green  dashed  line) 
using  PSD  projection.  The  “-I”  postfix  means  incomplete  data.  LRKC  with  different  A’s  and 
LRDC  with  different  r’s  are  shown. 


kernel  values,  so  that  we  can  skip  a  large  portion  of  the  kernel  computations  while  preserving 
the  learning  performances.  Meanwhile,  the  divergence  based  approach  is  less  effective  in  this 
case.  Another  interesting  direction  to  explore  is  that  we  can  borrow  the  ideas  of  active  learning 
and  purposefully  choose  which  kernel  values  to  observe,  so  that  we  can  recover  a  good  low-rank 
kernel  matrix  using  as  less  observations  as  possible. 
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Figure  6.5:  Classification  accuracy  on  the  Sports  data  set. 
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Figure  6.6:  Classification  accuracy  on  the  OT  data  set.  D  is  the  baseline  (green  dashed  line)  using 
PSD  projection.  The  “-I”  postfix  means  incomplete  data.  LRKC  with  different  A’s  and  LRDC  with 
different  r’s  are  shown. 
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Chapter  7 

Accelerated  Learning  by  Condensing 


In  addition  to  methods  proposed  in  this  thesis,  recently  several  other  algorithms  have  been  proposed 
to  leam  from  data  that  are  represented  as  sets/groups  of  vectorial  points.  Such  algorithms  usually 
suffer  from  the  high  demand  of  computational  resources,  making  them  impractical  on  large-scale 
problems.  We  propose  to  solve  this  problem  by  condensing  i.e.  reducing  the  sizes  of  the  sets 
while  maintaining  the  learning  performance.  Three  methods  are  examined  and  evaluated  with  a 
wide  spectrum  of  set  learning  algorithms  on  several  large-scale  image  data  sets.  We  discover  that 
fc-Means  can  successfully  achieve  the  goal  of  condensing.  In  many  cases,  /c-Means  condensing 
can  improve  the  algorithms’  speed,  space  requirements,  and  surprisingly,  learning  performances 
simultaneously. 


7.1  Introduction 

In  many  problems  the  object  of  interest  can  be  represented  by  a  set  of  multidimensional  vectors. 
For  instance,  in  computer  vision  an  image  is  often  treated  as  a  set  of  patches  [50].  In  text  processing 
and  retrieval,  we  can  also  think  of  a  document  as  a  set  of  sections/paragraphs  to  cope  with  its 
structure.  A  convenient  and  indeed  frequently  used  way  to  deal  with  point  sets  is  to  discretize 
the  points  and  construct  a  feature  vector  for  each  set.  However,  the  conversion  process  is  often 
problem-specific,  sometimes  complicated,  and  involves  much  human  effort.  More  importantly, 
this  set-to-vector  reduction  can  cause  loss  of  information. 

On  the  other  hand,  the  development  of  algorithms  that  handle  sets  directly  has  been  largely  left 
behind.  One  major  disadvantage  of  such  algorithms  is  their  high  computational  cost  compared  to 
those  that  operate  on  vectors.  Despite  the  difficulties,  recently  several  methods  have  been  proposed 
to  deal  with  point  sets  directly.  They  learn  from  the  sets  without  the  set-to-vector  reduction,  so  that 
the  researchers  do  not  have  to  design  the  feature  vector  for  a  set,  and  the  loss  of  information  caused 
by  the  reduction  can  be  avoided.  For  example,  our  Chapter  5  design  a  novel  kernel  between  point 
sets  based  on  consistent  estimators  of  divergences  between  distributions,  and  achieved  the  state- 
of-the-art  classification  performance  on  a  couple  of  datasets.  [17]  proposed  an  extremely  simple 
classifier  for  point  sets  based  on  group-to-class  matching,  and  showed  that  it  could  compete  with 
classifiers  based  on  very  sophisticated  set  features  on  images.  These  successes  demonstrate  the 
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advantage  of  learning  directly  from  point  sets  over  the  reduction  approach. 

Early  set  learning  algorithms  (more  specifically  set  similarities),  such  as  the  Hciusdorff  distance 
and  the  mean  map  kernels  by  [65],  rely  on  the  similarities  between  every  pair  of  points  and  are  thus 
computationally  expensive.  Recent  improvements  such  as  Chapter  5  and  [17,  112,  131]  gained 
efficiency  by  designing  algorithms  based  on  information  from  the  points’  local  neighborhoods, 
which  can  be  obtained  via  efficient  search  algorithms.  [17,  112]  proposed  a  new  classification 
paradigm  by  comparing  images  to  classes  and  significantly  accelerated  the  prediction.  Details 
of  these  methods  are  described  in  Section  7.2.  Nevertheless,  they  still  demand  much  time  and 
storage  space,  making  them  not  suitable  for  large-scale  problems  and  less  likely  to  be  adopted  by 
practitioners. 

In  this  work,  we  aim  to  further  improve  the  computational  efficiency  of  set  learning  algorithms. 
In  most  set  learning  algorithms,  the  cost  to  train,  store,  and  apply  the  model  is  determined  by  the 
sizes/cardinalities  of  the  sets.  Therefore,  our  approach  is  to  directly  attack  the  crux  of  the  problem 
by  reducing  the  size  of  sets  while  maintaining  the  learning  performance  in  an  unsupervised  way. 
We  call  such  an  operation  condensing. 

To  achieve  this  goal,  we  analyze  and  evaluate  three  possible  ways  of  decrease  the  size  of  a 
set:  random  sampling,  uniform  covering,  and  distribution  approximation  using  A'-Mcans.  These 
three  methods  are  chosen  because  they  are  easy  to  implement  and  efficient  to  run  in  large  data 
scenarios.  Our  discovery  is  that  distribution  approximation  via  A-Means  is  the  only  method  that 
can  successfully  achieve  the  goal  of  condensing. 

In  our  experiments,  we  apply  the  A'-Mcans  condensing  as  a  pre-processing  step  to  various 
point-set  learning  methods  on  several  image  classification  tasks,  and  find  that  the  performance  is 
surprisingly  good  and  consistent.  In  most  problems,  we  do  not  have  to  make  a  speed-accuracy 
tradeoff;  condensing  can  actually  improve  both  speed  and  accuracy  simultaneously .  In  addition, 
this  condensing  step  can  be  easily  implemented  and  parallelized  for  large-scale  problems.  We 
believe  this  discovery  is  useful  to  practitioners  that  have  large-scale  point-set  data. 

The  rest  of  this  chapter  is  organized  as  follows.  In  Section  7.2  we  introduce  the  notation  and 
common  learning  algorithms  on  point  sets  to  provide  a  context  to  this  study.  Section  7.3  briefly 
reviews  related  work.  Section  7.4  describes  in  detail  the  condensing  methods  we  are  examining. 
In  Section  7.5,  we  thoroughly  evaluate  the  performance  of  different  methods  on  different  data  sets 
and  discuss  our  findings.  Finally,  we  discuss  and  conclude  this  chapter  in  Section  7.6  and  Section 
7.7. 

7.2  Background 

Most  learning  tasks  can  easily  be  accomplished  if  we  know  the  similarities  between  the  point  sets. 
Many  set  learning  algorithms  assume  that  a  set  has  an  unknown  underlying  distribution,  and  the 
points  in  the  set  are  i.i.d.  samples  from  that  distribution.  Then  the  similarity  measures  between 
point  sets  can  be  designed  based  on  the  divergences  between  their  underlying  distributions.  For 
example,  [65,  155]  proposed  the  mean  map  set  kernel  to  test  if  two  point  sets  have  the  same 
underlying  distribution.  The  same  technique  has  been  used  by  multiple-instance  learning  [55]  and 
computer  vision  [109].  [17]  uses  a  simplified  kernel  density  estimator  to  estimate  the  divergence 
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between  a  set  and  the  classes,  and  assign  the  set  to  the  class  with  the  most  similar  distribution. 
[130,  131]  and  Chapter  5  use  a  consistent  nonparametric  estimator  to  get  the  divergences  between 
the  point  sets  and  use  these  dissimilarities  to  construct  Gaussian  kernels  so  that  SVM  can  be  used 
for  classification. 

From  the  computational  perspective,  many  of  the  set  similarity  measures  can  be  considered 
as  aggregations  of  the  similarities  between  the  individual  points  of  the  sets.  We  will  discuss  in 
more  details  how  these  similarities  are  measured  and  aggregated  in  Section  7.2.1.  The  key  point 
is  that  these  point-level  pairwise  comparisons  make  the  speed  of  the  algorithms  crucially  depend 
on  the  sizes  of  the  sets.  This  is  why  reducing  the  sets’  sizes  by  condensing  would  greatly  improve 
their  computational  efficiency.  Generative  methods  such  as  [173,  174]  and  Chapter  4  have  also 
been  developed  to  model  point  sets.  Condensing  the  sets  will  also  benefit  these  methods.  In  this 
Chapter  we  will  focus  on  the  similarity  based  approaches  and  their  applications  in  set  classification 
problems. 

We  again  consider  a  data  set  with  M  point  sets  {Gm}m= Gm  =  {xmn}n=i,...,Nm,xmn  G 
M D .  We  also  assume  that  each  Gm  has  an  unknown  underlying  distribution  /„,,  and  the  points 
{ xmn }  are  i.i.d.  samples  from  fm.  For  instance,  in  the  context  of  image  classification,  each  Grn  is 
an  image,  and  vector  xmn  is  the  feature  of  the  nth  patch  in  this  image. 

Nearest  neighbors  (NN)  are  frequently  used  in  set  learning  algorithms.  We  use  NNg(x)  to 
denote  the  NN  of  x  in  point  set  G.  If  x  is  in  G  then  it  excludes  itself  during  the  search.  Ties,  if  any, 
are  broken  arbitrarily. 

7.2.1  Set  Similarities 

Set  Kernels  [65,  155]  proposed  the  following  kernel  (similarity)  for  two  sets  of  points  G\  and 
G2' 

1  Ni  n2 

K(  ££  k(xu,x2j)  (7.1) 

1  2  i= 1  3= 1 

where  k(x,  y )  is  a  kernel  between  points  x  and  y.  One  particularly  popular  example  is  the  Gaussian 
kernel  k(x,y )  =  exp (  —  1 1 x  —  y\\i/x2).  The  underlying  principle  for  this  kernel  is  that  if  the 
point-level  kernel  k(xf  •)  induces  a  feature  map  <j){x)  for  a:  in  a  reproducing  kernel  Hilbert  space 
(RKHS),  then  this  corresponding  set-level  kernel  K(G,  •)  will  induce  the  feature  map  $(G)  = 
jr  J2n=\  4>(xn)  for  G  in  the  same  RKHS.  Since  GiG)  is  the  empirical  mean  of  the  mapped  features 
of  the  points,  K  is  called  the  mean  map  kernel  (MMK). 

We  can  see  that  MMK  essentially  is  a  way  of  aggregating  the  point-level  similarities  between 
two  sets.  It  possesses  many  nice  theoretical  properties  such  as  positive  definiteness  [65,  120].  The 
same  idea  has  also  been  used  in  computer  vision  [109]  and  multiple-instance  learning  [55].  Yet 
since  it  averages  the  similarities  between  every  pair  of  points,  MMK  will  slow  down  quadratically 
w.r.t.  the  sets’  sizes,  and  become  infeasible  for  even  moderately  sized  problems. 

[167]  used  a  variation  of  MMK  that  only  considers  the  most  similar  pairs  of  points: 

1  W  n2 

K{GuG2 )  =  -^fc(rlt!NN  g2(xu))  +  JF  E  NNGl  (x2j)).  (7.2) 

1  i=  1  V2  3  =  1 
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Computationally  it  improves  MMK  by  only  using  the  points’  NNs  instead  of  dealing  with  all 
pairs.  Unfortunately,  this  kernel  is  no  longer  a  proper  Mercer  kernel ,  but  it  still  serves  well  as  a 
similarity  measure.  Other  related  methods  include  [63],  which  uses  multi-resolution  histograms  to 
approximates  MMK,  and  [16],  which  constructs  explicit  approximate  feature  maps  for  the  MMK 
so  that  linear  classifiers  can  used  to  achieve  faster  computation. 


Set  Divergences  Another  class  of  dissimilarity  measures  is  defined  based  on  the  statistical  diver¬ 
gences  between  two  distributions.  In  [130,  131,  168],  the  authors  provided  consistent  NN  based 
estimators  for  various  divergences  including  the  Kullback-Leibler  (KL),  Renyi,  and  the  L2  diver¬ 
gences.  They  have  been  successfully  applied  to  image  classification  problems  when  used  in  SVMs; 
See  Chapter  5. 

For  example,  the  KL  divergence  between  two  point  sets  can  be  estimated  by  [168] 


KL^HCy 


D  ll^ij  —  NNG2(a;ij)||2  N2 

N~ij^  np^NN^)lb  ''AW 


(7.3) 


where  D  is  the  dimensionality  of  X\t.  It  was  proved  that  these  estimators  are  consistent,  i.e.  un¬ 
der  regularity  conditions,  (7.3)  converges  to  KL(/i||/2)  as  the  sample  sizes  Ni  and  N2  approach 
infinity. 

[17]  proposed  an  alternative  estimate  of  the  KL  divergences.  Consider  kernel  density  estimation 
at  point  Xu  given  the  points  in  G2  with  a  Gaussian  kernel  of  width  a: 


G2)  oc 


aN-, 


N2 

exp 


3  =  1 


,Xli 


a 


x2j  1, 2 

II 2 


(7.4) 


This  estimator  is  computationally  demanding  since  we  have  to  consider  every  pair  of  points.  In 
[17],  the  authors  let  the  width  o  be  small  enough,  so  that  the  summation  will  be  dominated  by  its 
largest  term  i.e.  the  nearest  neighbor,  and  the  estimator  becomes 


In  f(xu-G2)  «  -II^H  -  NNGa(a;ii)||2/cr2  -  lnA^cr  +  const,  (7.5) 

which  again  is  based  on  NNs.  The  corresponding  estimated  KL  divergence  is 

JVi  iVi 

KL(Gi||G2)  cx  ^  \\xu  -  NN^OnOll!  -  ^  ||xii  -  NNGl(xH)||2  (7.6) 

i= 1  i= 1 

up  to  a  constant.  Note  the  resemblance  between  (7.6)  and  (7.3).  Unlike  (7.3),  this  estimator  is  not 
consistent  even  with  infinite  points,  but  in  practice  it  can  still  produce  good  results. 

These  set  similarities  follow  a  similar  pattern.  They  generally  use  nearest  neighbor  statistics 
and  can  be  efficient  in  low  dimensions  where  various  search  trees  work  well.  However,  in  high 
dimensions  (as  few  as  10)  the  computational  speed  of  the  estimators  will  deteriorate  rapidly. 
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7.2.2  Set  Classification  Schemes 


Having  the  similarities  between  the  sets,  we  can  easily  apply  SVM,  KNN,  or  other  techniques  to 
accomplish  tasks  like  classification,  ranking,  and  clustering  as  in  e.g.  [120, 130,  131].  For  example, 
the  KL  divergences  estimated  using  (7.3)  can  be  used  construct  Gaussian  kernels  e.g. 

AKl(Gi||G2)  =  exp  (— KL^xllGy/rr2)  ,  (7.7) 

where  a  is  the  width  of  the  kernel.  Due  to  the  properties  of  the  KL  divergence,  this  kernel  is  neither 
symmetric  nor  positive  definite.  Therefore  in  Chapter  5  proposed  to  approximate  this  “pseudo” 
kernel  matrix  by  the  closest  positive  definite  matrix,  and  then  use  this  approximation  as  the  input 
to  SVMs. 

The  drawback  of  this  set-vs-set  approach  is  that  the  training  cost  grows  quadratically  and  the 
prediction  cost  grows  linearly  with  the  number  of  sets  for  training.  Considering  that  comparing 
a  pair  of  sets  already  requires  significant  work,  this  scheme  quickly  becomes  infeasible  in  larger 
problems.  To  solve  this  problem,  [17]  proposed  a  set-vs-class  paradigm  for  set  classification. 
Assume  that  there  are  C  classes  indexed  by  c,  and  class  c  is  represented  by  the  merged  set 

Hc=  1J  Gm, 

Gm^C 

which  contains  all  the  points  in  the  sets  that  belong  to  class  c.  The  classification  rule  is  to  assign  a 
test  set  G  to  the  class  with  the  smallest  divergence 

c(G)  =  argminKL(G||/7c),  (7.8) 

C 

where  the  KL-divergence  is  estimated  by  (7.6).  The  assumption  under  this  scheme  is  that  all  the 
sets  in  the  same  class  c  have  the  same  distribution  fc,  from  which  Hc  are  the  i.i.d.  samples.  Though 
it  is  debatable  if  this  assumption  is  valid  in  real-world  problems,  the  resulting  algorithm  called  the 
Naive  Bayes  nearest  neighbor  (NBNN)  [17]  is  extremely  simple  and  performs  well  empirically. 

More  importantly,  NBNN  discards  the  training  phase  and  makes  the  prediction  cost  of  (7.8) 
only  proportional  to  the  number  of  classes  as  0(NC(hc ),  where  N  is  the  size  of  the  test  set  and  (Hc 
denotes  the  complexity  of  one  NN  search  in  Hc.  Local  NBNN  (LNBNN)  [112]  further  improves 
NBNN  by  merging  all  classes  into  one  large  set  H  =  (Jc  II,  and  decreases  the  complexity  to 
OiNQii).  As  a  result,  LNBNN  can  classify  many  classes  very  efficiently.  Interested  readers  are 
encouraged  to  see  [112]  for  more  details.  Yet  in  large,  high-dimensional  problems,  {Hc}  and  H  can 
easily  become  huge,  making  (Hc  and  (H  unaffordable.  Another  problem  is  that  II  might  become  so 
large  that  it  cannot  be  held  in  memory,  making  even  the  approximate  NN  search  methods  infeasible. 

7.2.3  Computational  Issues 

Looking  at  the  above  algorithms,  we  realize  that  one  would  face  severe  challenges  due  to  both 
computational  time  and  space  demand  if  one  were  to  apply  these  algorithms  to  large-scale  prob¬ 
lems.  These  computational  requirements  are  determined  by  the  sizes  of  the  sets.  If  we  could  reduce 
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them  by  half,  the  space  complexity  would  drop  by  half,  and  ideally  the  running  time  would  drop 
to  only  a  quarter.  Therefore,  our  approach  to  make  the  “learning  on  sets”  problem  more  efficient  is 
to  directly  reduce  the  size  of  the  sets,  condensing  the  information  into  a  much  smaller  amount  of 
data  while  preserving  the  learning  performance. 

Even  though  NBNN  and  LNBNN  have  successfully  made  the  complexity  linear  w.r.t.  the  num¬ 
ber  of  sets,  they  create  huge  point  sets  {Hc}  for  each  class  that  are  difficult  to  store  and  use.  To  put 
it  in  context,  suppose  we  have  1,  000  images  for  training.  Typically  each  image  is  characterized 
by  around  2,  000  densely  sampled  128-dimensional  single  precision  SIFT  vectors.  Using  Local 
NBNN,  this  relatively  small  training  set  would  result  in  a  model  consisting  of  2  x  106  points  and 
1GB  of  data.  Additionally,  in  high  dimensions  searching  for  nearest  neighbors  in  such  a  large  set 
can  be  very  slow. 

In  practice  one  could  also  consider  using  the  approximate  NN  search  algorithms.  One  popular 
method,  for  example,  is  the  randomized  KDTree  [154]  algorithm  implemented  in  the  FLANN 1 
[121]  package.  It  checks  in  multiple  KDTrees  for  a  fixed  number  of  leaf  nodes,  which  is  the  budget 
set  by  the  user,  and  then  returns  the  best  results  it  can  find.  Its  exact  approximation  accuracy  and 
time  complexity  is  unclear  and  dependent  on  the  data.  When  using  such  approximate  methods,  it 
is  rare  to  achieve  quadratic  improvement  of  speed  by  reducing  the  size,  yet  condensing  can  still 
greatly  help  the  construction,  use,  and  storage  of  models.  When  the  data  is  too  large  to  fit  into 
memory,  then  even  approximate  search  is  infeasible,  but  condensing  can  make  it  possible. 


7.3  Related  Work 

As  far  as  we  know,  there  is  little  previous  work  that  thoroughly  studies  how  to  reduce  the  sets’  sizes 
in  set-based  learning.  One  reason  might  be  that  set-based  learning  itself  is  rather  new.  Random 
sampling  is  a  common  practice.  In  [162],  the  authors  used  an  asymmetric  approach  for  image 
classification.  The  reasoning  is  that  we  can  find  good  matching  patches  between  two  similar  images 
as  long  as  one  of  them  is  densely  sampled.  This  approach  is  actually  subsampling  the  set  on 
one  side  of  the  similarity/divergence  computation.  It  can  speed  up  the  computation  but  will  also 
deteriorate  the  classification  performance.  In  the  following  sections  we  show  that  by  using  more 
carefully  chosen  condensing  methods,  we  can  improve  both  the  speed  and  the  accuracy  of  their 
algorithm.  Recently,  the  kernel  herding  [29]  algorithm  was  proposed  to  accelerate  MMK  (7.1)  by 
selecting  a  small  number  points  to  represent  the  group.  Essentially,  herding  is  letting  the  selected 
points  to  approximate  the  original  distribution  of  the  group.  It  is  similar  to  the  A'- Means  condenser 
described  in  our  next  Section,  but  it  is  much  more  complex  and  less  practical  in  large  data  scenarios. 

In  point-based  learning,  condensing  point  sets  is  embodied  in  the  prototype  selection  (PS)  prob¬ 
lem  [53].  In  prototype  selection,  we  are  given  one  training  set  of  labeled  points,  and  the  goal  is  to 
reduce  the  size  of  the  training  set  while  maximizing  the  performance  of  the  resulting  classifier.  We 
can  see  that  PS  is  very  different  from  condensing  in  the  context  of  set-based  learning.  PS  handles 
one  set  and  focuses  on  the  behavior  of  individual  points.  In  contrast,  set-based  learning  handles 
multiple  sets  and  focuses  on  the  set  as  a  whole.  PS  methods  usually  focus  on  discriminating  the 

'http : / / www . cs . ubc . ca/ -mariusm/ index . php/FLANN 
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points  instead  of  preserving  the  statistical  properties  of  the  sets  that  are  needed  in  set-based  learn¬ 
ing.  Several  underlying  techniques  are  shared  between  PS  and  condensing  in  set-based  learning. 
Below  we  will  comprehensively  evaluate  the  techniques  that  are  suitable  for  set-based  learning. 
Our  contribution  is  the  discovery  of  a  good  condensing  method  for  set-based  learning  algorithms 
that  can  improve  their  speed,  space  requirement,  and  accuracy  all  at  the  same  time. 


7.4  Condensing  Methods 

We  examine  three  potential  ways  of  condensing  a  point  set  G  of  size  A"  to  a  smaller  point  set  G  of 
size  K,  so  that  the  properties  of  G  that  are  useful  to  set  learning  are  preserved  in  G.  These  three 
methods  are  selected  for  the  following  reasons: 

•  They  have  sound  theoretical  bases. 

•  They  are  easy  and  efficient  to  use  in  large-scale  problems. 

•  They  are  universal  i.e.  not  coupled  with  the  subsequent  learning  algorithms,  and  thus  widely 
applicable. 


(a)  All  points 


+^7++V++  + 


+  +  #4^#+  + , 
4+  +  r 


(b)  Random  sampling 


Figure  7.1:  Condensing  results  of  a  2D  standard  Gaussian  point  set  using  different  condensers. 


7.4.1  Random  Sampling 

Statistically,  random  subsampling  can  create  smaller  point  sets  G  with  the  same  statistical  proper¬ 
ties  as  the  original  set  G.  This  is  the  common  method  used  to  make  trade-offs  between  speed  and 
accuracy  [133],  and  it  is  essentially  the  asymmetric  approach  used  in  [162].  It  is  easy  to  implement 
with  little  computational  cost.  However,  when  given  only  one  chance  to  subsample,  the  result  is 
random  with  possibly  high  variance,  and  might  lead  to  poor  results.  Figure  7.1b  shows  an  example 
in  which  the  subsampled  set  is  far  from  an  ideal  representation  of  the  original  set.  We  will  use  this 
method  as  our  baseline. 

7.4.2  Uniform  Covering 

We  can  also  control  the  condensing  quality  at  the  point  level.  Since  many  learning  algorithms 
are  based  on  NNs,  we  require  for  any  point  x,  the  change  of  distance  to  its  NN  in  G  is  bounded 
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individually  after  condensing.  Formally,  we  are  looking  for  a  G  such  that 

|||.r  —  NNq(x)||2  —  ||x  —  NNG(a;)||2|  <  r,  Vx, 

where  r  is  the  error  threshold. 

To  find  such  a  G  with  minimal  size  is  NP-hard  [11].  [159]  proposed  kernel  vector  quantization 
which  used  the  l\  relaxation  and  provided  an  approximate  solution  using  linear  programming ,  but 
it  can  be  too  slow  when  condensing  large  sets.  Instead,  we  adopt  the  following  simple  solution. 
Starting  from  an  empty  G,  a)  pick  a  point  x  from  G  and  move  x  into  G;  b)  Remove  points  in 
G  whose  distance  to  x  is  less  than  r;  Repeat  a)  and  b)  until  G  is  empty.  The  points  in  G  form 
the  centers  of  radius-r  spheres  that  uniformly  cover  the  support  of  G’s  underlying  distribution. 
Therefore,  we  call  this  uniform  covering  condensing.  An  illustration  is  given  in  Figure  7.1c. 

To  minimize  the  size  G,  a  heuristic  is  to  pick  points  in  denser  regions  early  so  that  more  points 
can  be  removed.  [78]  implemented  such  a  heuristic.  It  runs  Mean  Shift  [32]  to  find  a  local  peak  of 
the  density,  and  then  picks  a  point  from  the  peak.  The  complexity  is  0(N2)  for  a  suitable  r. 

Though  it  seems  well  motivated,  this  approach  is  flawed.  First,  we  cannot  control  the  number 
of  points  to  cover  the  support.  To  find  a  proper  r  we  would  need  trial  runs  and  tuning.  Secondly,  it 
only  captures  the  support  of  the  underlying  distribution  and  ignores  the  actual  density  levels.  The 
most  severe  problem,  however,  is  the  curse  of  dimensionality.  In  high  dimensions,  the  neighbor¬ 
hood  of  a  point  becomes  so  large  that  we  need  a  very  large  r  to  effectively  trim  the  size  down. 
Sometimes  r  is  not  much  smaller  than  the  diameter  of  the  support,  making  the  bound  useless.  We 
shall  demonstrate  this  effect  in  the  experiments. 

7.4.3  fc-Means 

As  a  clustering  algorithm,  fc-Means  can  also  serve  our  condensing  purposes  well.  We  can  run  k- 
Means  on  the  set  G,  and  then  use  the  set  of  cluster  centers  as  G.  Recall  that  A'- Means  minimizes 
the  following  objective 


N 

G  =  {xk}k=i,...,K  =  argminy^llx  -  NN<5(a?) |||.  (7.9) 

G  i= i 

In  other  words,  it  is  trying  to  find  a  point  set  G  to  best  reconstruct  G  using  the  nearest  neighbor 
method  given  the  budget  of  K  points. 

We  can  prove  that  the  A;-Means  condensing  is  actually  maximizing  the  performance  of  NBNN. 
Recall  that  NBNN  assumes  that  sets  from  the  same  class  c  share  the  same  distribution  fc  that 
is  characterized  by  the  merged  class  representation  Hc,  as  described  in  Section  7.2.2.  Then  we 
assign  a  test  set  G  to  the  class  with  the  smallest  KL-divergence  according  to  (7.8).  Hence,  to 
find  a  condensed  set  Hc  for  II r  that  can  maximize  the  classification  performance  of  class  c,  we 
should  minimize  KL(G||77C)  for  any  G  with  the  distribution  fc.  Since  fc  is  characterized  by  Hc, 
the  objective  should  be 

Hc  =  aigmmKL{Hc\\Hc).  (7.10) 

He 
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We  can  see  that  Hc  should  approximate  the  distribution  of  II c.  Further,  by  plugging  (7.6)  -  the 
KL-divergence  estimator  used  by  NBNN  -  into  (7.10),  we  see  that  (7.10)  is  indeed  equivalent  to 
the  /.  -Means  objective  (7.9).  In  this  sense,  A- Means  is  the  ideal  condenser  for  NBNN  classifiers. 
In  the  experiments  we  show  that  it  is  also  generally  good  for  other  set  learning  algorithms. 

Another  advantage  of  A'-Mcans  is  its  efficiency.  Being  an  extensively  used  and  studied  method, 
A'- Means  can  be  implemented  with  highly  efficient  algorithms  for  even  massive  data  sets  (e.g.  [49, 
153]).  In  practice,  we  also  found  that  the  exact  solutions  or  even  the  local  minima  of  A  - Means 
is  not  required.  Usually  running  A;-Means  for  only  tens  of  iterations  is  enough  to  achieve  good 
condensing  performance. 

To  sum  up,  /c-Means  provides  a  well-justified  and  efficient  way  to  condense  point  sets.  Fig¬ 
ure  7. Id  shows  the  visually  appealing  result  of  A'-Mcans  compared  to  other  condensers.  In  the 
experiments,  we  will  also  show  that  it  performs  surprising  well  in  classification  tasks. 


7.5  Empirical  Evaluation 

In  this  section  we  evaluate  the  performances  of  the  above  condensing  methods  when  applied  to 
various  set  learning  algorithms  in  different  image  classification  tasks.  The  three  condensing  meth¬ 
ods  described  in  Section  7.4  are  tested.  They  are  denoted  as  Rand:K,  Unif:K,  and  KMeans:K 
respectively,  where  K  is  the  condensed  size.  All  is  used  to  denoted  the  original  set.  We  only  eval¬ 
uate  the  uniform  covering  condenser  on  a  small  problem  to  demonstrate  its  deficiencies.  For  the 
A'-Mcans  condenser,  we  run  AxMeans  once  using  the  random  initialization  for  20  iterations. 

Five  image  data  sets  of  different  scales  and  natures  are  used  for  evaluation.  For  classification, 
as  described  in  Section  7.2.2,  we  consider  both  the  set-to-set  scheme  using  the  MMK  (7.1)  and  the 
KL  divergence  based  Gaussian  kernel  (7.7)  (KLK),  and  the  set-to-class  scheme  using  NBNN  [17], 
LNBNN  [112],  and  NPKL  which  is  the  NBNN  classifier  paired  with  the  divergence  estimator 
(7.3). 

For  MMK  and  KLK,  we  use  the  condenser  to  reduce  the  size  of  every  training  and  testing 
point  set  separately.  For  NBNN,  LNBNN,  and  NPKL,  the  condensers  are  used  to  reduce  the  class 
representations  {Hc}.  When  using  KLK  with  SVM,  the  resulting  kernel  matrices  are  projected  to 
the  closest  positive  definite  matrix  as  in  [131].  For  MMK  and  SVMs,  the  kernel  width  a  and  slack 
parameter  C  are  tuned  on  the  training  data  using  3-fold  cross-validation. 

MMK,  KLK,  and  Uniform  Covering  is  only  applied  to  small  problems  because  they  are  slow 
compared  to  other  methods.  Only  LNBNN  is  used  for  very  large  scale  problems,  since  it  is  the 
only  one  that  is  fast  enough. 

We  extract  multiscale  dense  SIFT  features  called  PHOW  [18]  for  images.  Images  are  resized 
so  that  the  longest  side  is  no  larger  than  256  pixels.  The  step  size  10  is  used  to  sample  patches  and 
the  patch  sizes  are  [24, 36, 48].  This  setting  will  produce  about  1,  500  128-dimensional  points  for  a 
256  x  256  image.  We  also  append  the  patches’  spatial  position  in  the  image  to  the  feature  vectors 
to  enable  spatial  matching,  making  the  points  130  dimensional.  The  weight  of  the  coordinates 
in  distance  calculation  is  roughly  tuned  on  a  small  subset  and  kept  the  same  throughout  all  the 
algorithms  and  runs. 

The  VlFeat  [164]  software  is  used  for  A'-Mcans  and  dense  SIFT  feature  extraction.  The  FLANN 
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[121]  software  is  used  for  NN  search.  Exact  NNs  are  used  for  KLK  on  small  scale  data  sets.  In 
large  experiments  of  NBNN,  LNBNN,  and  NPKL,  we  use  approximate  NNs  with  four  randomized 
KDTrees.  The  number  of  leaf  node  checks,  which  controls  the  precision  of  the  approximate  NN 
search,  is  stated  in  each  experiment  and  figures.  Experiments  are  done  on  Opteron  K10  2.3  GHz 
CPUs. 

7.5.1  Scene-15 

The  first  data  set  we  consider  is  the  Scene- 15  data  set  [92],  which  is  a  widely  used  benchmark  for 
scene  classification.  This  data  set  contains  4, 485  images  from  15  classes.  In  general  a  scene  image 
is  characterized  by  the  distribution  of  features  e.g.  the  proportion  of  sky,  water,  flat  surfaces,  etc. 
This  is  quite  different  from  images  for  object  recognition  that  we  will  present  later.  For  this  data 
set,  the  weight  of  the  spatial  coordinates  is  set  to  3.  In  each  run,  we  randomly  choose  100  images 
for  training  and  100  images  for  testing  unless  stated  otherwise. 

Set-to-Set  Classification  To  use  MMK  and  KLK  for  classification,  we  first  calculate  the  kernel 
values  between  every  pair  of  sets  by  (7.1)  and  (7.7),  and  then  given  them  to  SVM  and  KNN  for 
classification.  For  KNN  the  number  of  neighbors  is  tuned  based  on  the  training  data.  Because  this 
approach  is  very  computationally  expensive,  we  test  it  only  on  the  first  8  classes,  known  as  the  OT 
data  set  [5].  Each  image  is  a  set  of  1,  542  points,  and  the  condensers  Rand:  100  and  KMeans:100 
are  compared,  meaning  that  we  use  subsampling  and  /c-Means  to  reduce  the  sets’  sizes  from  1,  542 
to  100.  In  each  run,  we  randomly  choose  50  images  for  training  and  50  for  testing. 

The  performance  of  10  runs  are  reported  in  Figure  7.2.  We  can  see  that  random  sampling  sig¬ 
nificantly  decreased  the  accuracies  of  the  classifiers.  However,  using  the  K Means:  100  condensing, 
the  performances  of  SVMs  are  very  close  to  using  the  original  sets.  For  MMK,  the  decrease  of 
mean  accuracies  is  around  than  0.8%  and  the  Wilcoxon  signed  rank  test  shows  a  p- value  of  0.07. 
For  KLK,  the  difference  is  slightly  larger  at  about  2%.  This  is  a  very  good  performance  consid¬ 
ering  that  only  1/15  of  the  data  is  kept.  It  is  interesting  to  see  that  KNN’s  accuracy  using  the 
A- Means  condensed  sets  is  significantly  better  than  the  uncondensed  result.  We  shall  see  later  this 
is  not  a  random  effect.  As  a  reference,  we  also  provide  the  results  based  on  the  distance  between 
bag-of-words  representations  (500  visual  words)  using  the  same  features  and  classifiers. 

The  time  to  compute  both  MMK  and  KLK  using  all  the  points  is  about  6,  700  CPUxminutes. 
After  condensing  it  took  about  26  CPU  x  minutes  to  compute  MMK  and  33  CPU  x  minutes  to  com¬ 
pute  KLK,  which  are  200  -  250  times  faster.  AxMeans  condensing  only  took  about  0.1  CPU x  second 
for  each  set. 

The  Uniform  Covering  Condenser  Here  we  test  the  performance  of  uniform  covering  con¬ 
denser  paired  with  LNBNN  classification  on  the  full  Scene-15  data  set.  The  original  H  in  LNBNN 
contains  about  2  x  106  points.  Note  that  with  this  condenser  we  cannot  specify  the  size  but  only 
the  radius  r. 

As  mentioned  in  Section  7.4,  this  condenser  is  problematic  in  high  dimensional  spaces.  Figure 
7.3b  shows  the  relationship  between  the  condensed  size  and  the  radius  r.  To  reduce  the  size  to 
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Figure  7.2:  Accuracies  on  the  OT  data  set  using  the  original  sets  and  the  condensed  sets.  Green 
dashed  lines  are  the  accuracies  of  bag-of- words  classifiers. 


e.g.  3,  000,  we  need  r  «  400  which  is  very  large.  Either  increasing  or  decreasing  r  would  result  in 
a  dramatic  change  of  size.  This  phenomenon  is  a  natural  result  of  the  curse  of  dimensionality.  In 
practice,  it  is  very  hard  to  get  a  good  sense  of  how  large  the  radius  should  be  in  high  dimensions. 

Figure  7.3a  shows  the  relationship  between  the  accuracy  and  the  radius  from  10  runs.  Different 
number  of  checks  in  the  approximate  NN  search  is  tried.  We  can  see  that  the  decreases  of  accuracy 
is  unacceptable  when  the  sets  are  condensed  to  a  reasonable  size.  In  fact,  as  we  will  show  later,  its 
performance  is  even  worse  than  random  sampling. 

This  experiment  confirms  that  the  uniform  covering  condenser  has  ill-formed  behavior  and  bad 
performance.  Moreover,  it  is  also  slower  than  other  condensers.  Therefore,  we  shall  exclude  it 
from  the  subsequent  experiments. 


The  Random  and  k-Means  Condenser  Now  we  evaluate  the  sampling  and  k-Means  condensers 
with  the  NBNN,  LNBNN,  and  NPKL  classifiers.  Again  the  original  classifiers  contain  about  2  x 
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Accuracy  vs  Condensing  Radius  Condensed  Size  VS  Condensing  Radius 


Figure  7.3:  Performance  of  LNBNN  on  Scene-15  using  the  uniform  covering  condenser. 


106  points.  512  checks  are  used  in  NN  search.  Figure  7.4  shows  the  accuracies  from  5  random 
runs  using  different  condensers  and  different  classifiers.  We  can  see  that  k-Means  condensing 
is  much  better  than  random  sampling.  More  surprisingly,  classifiers  using  data  condensed  by  k- 
Means  consistently  and  significantly  outperform  those  using  the  original  uncondensed  data.  In 
other  words,  we  improved  the  speed  and  the  accuracy  simultaneously ,  as  opposed  to  make  trade¬ 
offs  between  them.  The  explanation  might  be  that  the  condenser  removes  some  of  the  noisy  and 
outlier  points  in  the  original  sets.  In  Figure  7.4a  we  can  observe  that  the  accuracy  decreases  a  little 
when  the  condensed  size  is  very  large.  We  shall  see  that  this  behavior  is  consistent  throughout 
most  of  our  experiments. 

We  also  examine  the  impact  of  the  number  of  checks  in  NN  search  in  Figure  7.5.  The  approxi¬ 
mate  search  algorithm  performs  very  well.  The  impact  of  the  number  of  checks  is  minimum,  and 
the  performance  usually  saturates  with  512  checks. 

7.5.2  UIUC -Sports 

The  UlUC-Sports  data  set  [100]  contains  1,  030  images  from  8  sport  events.  In  order  to  test  the 
performance  in  higher  dimensions,  we  use  the  color  version  of  the  PHOW  feature  which  is  384 
dimensional.  The  image  sizes  vary  so  that  each  one  contains  295  to  1, 542  points.  In  each  run  70 
images  are  used  for  training  and  60  for  testing.  As  a  result,  the  original  classifiers  contain  about 
6  x  105  points.  The  weight  of  spatial  coordinates  is  0.6.  Other  settings  remain  the  same  as  the 
Scene- 15  experiment. 

Accuracies  of  5  random  runs  are  reported  in  Figure  7.6.  We  can  observe  again  that  the  k-Means 
condensing  is  better  than  both  sampling  and  no-condensing.  This  verifies  again  the  benefit  of  re¬ 
moving  noise  brought  by  condensing.  We  also  noticed  that  using  all  the  points,  the  accuracy  of 
NPKL  is  worse  than  NBNN  and  LNBNN.  But  after  condensing,  these  three  algorithms  perform- 
s  almost  the  same.  This  shows  that  /c-Mcans  condensing  could  make  the  data  less  sensitive  to 
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(a)  NBNN  (b)  LNBNN 


Accuracy  VS  Condensed  Size 


(c)  NPKL 

Figure  7.4:  Scene-15  classification  performances  using  different  classifiers  and  condensers. 


different  algorithms. 

7.5.3  CalTech-101 

The  CalTech- 101  data  set  [98]  is  a  standard  benchmark  for  object  recognition.  This  data  set  con¬ 
tains  9, 144  images  of  102  different  object  classes.  Unlike  the  Scene- 15  and  UlUC-Sports  data  set, 
the  class  of  an  object’s  image  is  more  determined  by  the  presence  of  a  few  distinctive  local  features 
(intuitively  the  object  parts)  than  the  distribution  of  features. 

We  follow  the  standard  protocol  and  use  10, 15,  30  images  per  class  for  training  and  the  rest 
for  testing.  We  only  test  the  performance  of  LNBNN  using  different  condensers  as  it  is  the  only 
classifier  that  scales  well  with  this  problem.  We  compare  the  accuracies  of  Rand:4000  and  K- 
Means:4000  condensers  to  the  accuracy  without  condensing.  Without  condensing,  the  classifier 
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Figure  7.5:  The  impact  of  the  number  of  checks  in  the  NN  search  to  different  methods  on  the 
Scene- 15  data  set. 


contains  6  x  106  points  taking  about  3GB  memory  given  30  training  images,  while  the  condensed 
classifier  only  takes  202MB  memory  irrespective  of  the  number  of  training  images.  Note  that  we 
use  a  larger  condensed  size  of  4,  000  expecting  that  more  points  are  needed  to  accurately  capture 
the  distinctive  features  of  the  objects.  The  weight  of  spatial  coordinates  is  1.5. 

Figure  7.7a  shows  the  performance  of  10  random  runs.  We  can  see  that  fc -Means  is  much  better 
than  the  random  condensing,  k- Means  slightly  outperforms  the  uncondensed  results  again,  even 
though  the  difference  is  insignificant.  Observe  that  the  improvement  of  /.'-Means  condensing  over 
using  the  all  points  is  becoming  smaller  as  more  training  images  are  used.  This  may  be  because 
as  more  images  are  added,  4,  000  points  is  becoming  insufficient  to  capture  all  the  information 
contained  in  the  training  set.  The  time  used  for  fc-Means  condensing  is  4  CPU*minutes  per  class 
with  30  training  images.  The  prediction  takes  100  CPU*minutes  using  the  condensed  classifier, 
while  without  condensing  it  takes  161  CPU*minutes.  The  acceleration  is  not  much  here  because 
the  highly  efficient  approximate  NN  search  is  used.  Yet  reducing  the  space  requirement  by  93%  is 
still  a  significant  benefit. 
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(a)  NBNN  (b)  LNBNN 
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(c)  NPKL 

Figure  7.6:  UlUC-Sports  classification  performances  using  different  classifiers  and  condensers. 

7.5.4  CalTech-256 

CalTech-256  is  an  enlarged  version  of  the  previous  CalTech- 101  data  set,  containing  30,  607  im¬ 
ages  from  257  object  classes.  The  same  settings  are  used  as  for  CalTech- 101  except  that  the  weight 
of  spatial  coordinates  is  0.6.  Note  that  without  condensing,  the  LNBNN  classifier  contains  1.4  x  10 7 
points  taking  7GB  memory,  which  is  approaching  the  limit  of  readily  available  machines.  After 
the  condensing,  the  classifier  takes  only  500MB  memory,  irrespective  of  the  number  of  training 
images. 

Figure  7.7b  shows  the  performance  of  5  random  runs.  The  behaviors  of  the  condensers  are 
basically  the  same  as  in  the  CalTech- 101  experiment,  showing  the  consistency  of  the  condensers. 
Notably,  when  30  training  images  are  used,  the  condenser  seems  to  have  reached  the  limit  and 
causes  a  slight  decrease  of  accuracy.  It  shows  that  the  information  carried  by  the  training  set  is 
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Accuracy  VS  Num  of  Training  Images  Accuracy  VS  Num  of  Training  Images 


Figure  7.7:  (a)  CalTech- 101  classification  accuracies  using  LNBNN  with  different  condensers,  (b) 
CalTech-256  classification  accuracies  using  LNBNN  with  different  condensers. 


finally  exceeding  the  capacity  of  the  KMeans:4000  condenser  and  more  points  are  needed  to  main¬ 
tain  performance.  The  time  used  for  fc-Means  condensing  is  again  4  CPU x minutes  per  class  with 
30  training  images.  The  prediction  takes  440  CPU x  minutes  using  the  condensed  classifier,  while 
without  condensing  it  takes  791  CPU  x  minutes.  In  this  larger  problem  the  condenser’s  acceleration 
effect  is  becoming  more  prominent,  even  if  the  approximate  NN  searcher  is  used. 

7.5.5  ImageNet  Challenge  2012 

The  ImageNet  Challenge  20 122  [41]  provides  a  massive  object  image  classification  task,  containing 
1,  261, 406  images  from  1,  000  object  classes  retrieved  by  crawling  the  Internet.  The  large  amount 
of  variations  in  the  perspective,  object  appearance,  and  background  clutter  make  it  a  extremely 
challenging  task.  Because  of  the  large  number  of  classes  and  possible  ambiguities,  5  guesses  are 
allowed  when  predicting  an  image’s  label. 

We  use  the  dense  SIFT  features  provided  by  the  organizer3,  which  provides  about  800  SIFT 
vectors  per  image.  We  apply  LNBNN  to  this  classification  task  with  500  images  per  class  for 
training  and  500  images  per  class  for  testing,  resulting  in  an  experiment  that  involves  1  million 
images.  This  experiment  is  too  large  to  be  feasible  on  reasonable  machines  without  condensing; 
the  training  set  alone  would  take  140GB  memory. 

Rand:2000  and  KMeans:2000  condensers  are  used  in  this  task.  The  size  of  the  classifier  after 
condensing  is  about  1GB.  Since  we  are  not  able  to  complete  the  task  with  all  the  training  points, 
the  condensed  result  by  Rand: 20000  is  used  as  a  surrogate,  which  is  feasible  but  already  runs  very 

2http : / / www . image- net . org/challenges/LSVRC/ 2012/ index 

3http : / / www . image- net . or g/ down load- features 
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slow.  For  the  A'- Means  condensing,  we  first  use  random  sampling  to  reduce  the  input  sets’  sizes 
to  105  and  then  run  A; -Means.  256  checks  are  used  in  the  NN  search,  and  the  weight  of  spatial 
coordinates  is  0.9. 

The  results  from  5  runs  of  KMeans:2000,  Rand:2000  and  2  runs  of  Rand:20000  are  shown  in 
Table  7.1.  We  can  see  that  A;-Means  condenser  performs  around  70%  better  than  the  sampling  con¬ 
denser  using  the  same  amount  of  data,  and  also  11%  better  than  the  sampling  condenser  that  uses  10 
times  more  data,  showing  the  effectiveness  of  A;-Means  condensing  in  optimizing  the  classification 
performance. 

The  running  time  for  different  condensers  are  also  reported.  Note  that  here  “Training”  is  just 
the  condensing  step.  We  see  that  even  if  the  approximate  NN  searcher  is  used,  condensing  can  still 
make  the  prediction  speed  6  times  faster,  and  this  improvement  will  become  significantly  larger  if 
more  accurate  NN  search  is  needed.  The  sampling  condenser  basically  costs  no  time  except  for  the 
disk  10.  On  the  other  hand,  the  A;-Means  condensing  takes  less  than  4  minutes  per  class.  In  large- 
scale  parallel  computation,  this  extra  cost  is  acceptable,  and  the  improvement  to  the  prediction 
speed  and  accuracy  is  significant.  In  all,  again,  condensing  makes  the  classifier  smaller,  faster,  and 
more  accurate. 

Note  that  our  results  here  are  mainly  to  show  the  effectiveness  of  the  A;-Means  condenser  and 
not  comparable  to  the  ImageNet  Challenge’s  top  performers.  We  used  the  provided  features  in¬ 
stead  of  doing  feature  engineering/learning,  and  the  algorithm  used  here  is  extremely  simple  and 
efficient. 


Condenser 

Rand:  2000 

Rand:  20000 

KMeans:2000 

Accuracy  (%) 
Training  Time 
Testing  Time 

14.04  ±0.05 

0.07 

0.52 

21.18  ±0.07 

0.17 

2.91 

23.7  ±0.55 

3.7 

0.53 

Table  7.1:  Accuracies  and  running  time  of  LNBNN  on  ImageNet.  The  training  time  is  measured 
by  CPU*minute  per  class  and  the  testing  time  is  measured  by  CPU*second  per  test  image. 


7.6  Discussion 

Typically  when  facing  large  point  sets,  a  popular  approach  is  to  subsample  them  to  make  a  trade-off 
between  accuracy  and  speed  [133].  Our  experiments  show  that  this  approach  often  compromises 
too  much  accuracy.  However,  when  we  use  the  A;-Means  condenser,  we  can  often  improve  the 
speed,  space  requirement,  and  the  accuracy  all  at  the  same  time. 

Depending  on  the  data  set,  the  A;-Means  condensing  can  have  different  impact  on  the  perfor¬ 
mance.  When  the  sets  are  mainly  characterized  by  the  holistic  characteristics  of  its  points,  A;-Means 
condensing  can  not  only  reduce  the  size  significantly  while  retaining  the  information,  but  it  can  al¬ 
so  possibly  remove  noise  and  outliers  to  enhance  the  accuracy.  Examples  of  such  data  sets  include 
the  Scene-15  and  the  UlUC-Sports.  If  the  sets  are  mainly  characterized  by  a  few  distinctive  points, 
like  in  the  CalTech  data  sets,  approximation  error  on  the  individual  points  plays  a  bigger  role  and 
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condensing  is  usually  less  effective.  Nonetheless,  even  in  those  data  sets,  we  see  that  A'- Means  can 
still  at  least  maintain  the  accuracy  while  greatly  improve  the  time  and  space  efficiency. 

To  use  the  condensing  algorithms,  we  need  to  choose  the  size  of  the  condensed  set.  A  general 
guideline  is  to  set  the  budget  of  time  and  space  and  use  the  largest  number  of  points  allowed.  Our 
experience  shows  that  1,  000  -  5,  000  points  usually  works  well  for  set-vs-class  classifiers,  and  100 
-  500  points  should  work  for  set-vs-set  classifiers.  If  the  purpose  is  to  use  condensing  to  remove 
the  noise  and  improve  the  accuracy,  then  we  can  use  cross-validation  to  determine  the  appropriate 
size. 

The  cost  of  A'- Means  is  not  trivial  but  very  manageable.  The  condensing  of  different  sets  are 
independent.  In  our  experiments,  we  used  Elkan’s  algorithm  [49]  in  VLFeat  [164],  which  is  not  the 
fastest  algorithm  like  [153]  but  can  still  condense  105  points  to  2,  000  points  in  less  than  4  minutes. 
In  a  large-scale  parallel  computation  environment  like  MapReduce ,  this  is  very  acceptable.  We 
believe  that,  given  the  budget  of  time  and  space,  it  is  almost  always  beneficial  to  apply  A;-Means 
condensing  before  a  learning  algorithm  on  point  sets.  Compared  to  random  sampling,  it  will  result 
in  a  much  better  accuracy  within  acceptable  time. 

For  algorithms  that  need  all  point- wise  similarities  or  exact  NNs  in  high  dimensions,  con¬ 
densing  can  easily  provide  quadratic  improvement  for  speed  and  linear  improvement  for  space 
requirement.  In  this  latter  case,  condensing  can  turn  impossible  tasks  into  possibilities.  When 
approximate  NN  search  is  used  and  we  have  a  small  data  set,  the  improvement  for  speed  is  not  sig¬ 
nificant,  such  as  in  the  Scene- 15  and  UlUC-Sports  data  set.  However,  when  the  data  set  becomes 
as  big  as  the  CalTech-256  or  even  the  ImageNet  datasets,  then  condensing  can  provide  substantial 
improvement  on  top  of  the  approximate  NN  search. 


7.7  Summary 

Efficient  algorithms  for  learning  from  point  sets  are  important  and  useful,  yet  existing  methods 
suffer  from  high  time  and  space  demand.  In  this  work  we  tried  to  condense  the  point  sets  (reduce 
the  size  of  the  sets)  in  order  to  make  these  methods  faster  and  better.  We  found  that  condensing  is 
in  general  a  better  strategy  than  imputing  the  kernels  as  described  in  Chapter  6.  Particularly,  we 
discovered  that  the  A;-Means  algorithm  does  this  job  very  well. 

On  a  wide  range  of  classification  methods  and  image  data  sets,  we  evaluated  three  different 
practical  condensing  strategies  and  found  the  A;-Means  is  the  only  one  that  can  successfully  reduce 
the  size  of  point  sets  without  much  loss  of  accuracy.  In  many  cases  it  even  improves  the  accuracy 
by  removing  the  noise  and  outliers.  This  success  seems  to  be  universal  despite  the  differences 
across  various  classifiers  and  data  sets.  We  hope  our  discovery  could  help  the  adoption  of  the 
point-set  based  methods  by  the  practitioners  in  large  scale  problems. 
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Chapter  8 

Sampling  Bias  Correction  by  Conditional 
Divergences 


Many  objects  can  be  represented  as  sets  of  multi-dimensional  points.  A  common  approach  to 
learning  from  these  point  sets  is  to  assume  that  each  set  is  an  i.i.d.  sample  from  an  unknown  un¬ 
derlying  distribution,  and  then  estimate  the  similarities  between  these  distributions.  In  realistic 
situations,  however,  the  point  sets  are  often  subject  to  sampling  biases.  These  biases  can  funda¬ 
mentally  change  the  distributions  and  distort  the  results  of  estimation  and  learning.  In  this  work 
we  propose  to  use  conditional  divergences  to  correct  these  distortions  and  learn  from  biased  point 
sets  effectively.  Our  empirical  study  shows  that  the  proposed  method  can  successfully  correct  the 
biases  and  achieve  satisfactory  learning  performance. 


8.1  Introduction 

Traditional  learning  algorithms  deal  with  vectors/points,  but  many  real  objects  are  actually  sets  of 
points  that  are  multi-dimensional,  real-valued  vectors.  For  instance,  in  monitoring  problems,  each 
sensor  produces  one  set  of  measurements  for  a  particular  region  within  a  time  period.  A  traditional 
way  to  deal  with  point  sets  is  to  construct  feature  vectors  for  the  sets  through  discretization  so  that 
standard  learning  techniques  can  be  applied.  However,  this  conversion  process  often  relies  on  hu¬ 
man  effort  and  is  prone  to  information  loss.  Recently,  several  algorithms  were  proposed  to  directly 
leam  from  point  sets  based  on  the  assumption  that  each  point  set  is  a  sample  from  an  unknown 
distribution,  including  methods  proposed  in  this  thesis.  [130,  131]  proposed  novel  kernels  between 
point  sets  based  on  efficient  and  consistent  divergence  estimators.  [65,  120]  took  a  similar  ap¬ 
proach  and  designed  a  kernel  based  on  the  kernel  embedding  of  distributions.  [17,  1 12]  developed 
extremely  simple  classifiers  for  point  sets  based  on  divergences  between  sets  and  classes.  These 
methods  achieved  impressive  empirical  successes,  showing  the  advantage  of  learning  directly  from 
point  sets. 

One  factor  that  can  significantly  affect  the  effectiveness  of  learning  is  sampling  bias.  Sampling 
bias  comes  from  the  way  we  collect  points  from  the  underlying  distributions,  and  makes  the  ob¬ 
served  sample  not  representative  of  the  true  distribution.  It  undermines  the  fundamental  validity 
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of  learning  because  the  points  are  no  longer  i.i.d.  samples  from  a  distribution  conditioned  only  on 
the  object’s  type.  Though  it  has  been  extensively  studied  in  statistics,  this  key  problem  has  been 
largely  ignored  by  the  previous  research  on  learning  from  sets.  The  goal  of  this  paper  is  to  alleviate 
the  impact  of  sampling  bias  when  measuring  similarities  between  point  sets. 

We  consider  point  sets  with  the  following  structure.  Let  each  point  be  described  by  two  groups 
of  random  variables:  the  independent  variables  (z.v.)  and  dependent  variables  (d.v.).  A  point  is 
collected  by  first  specifying  the  value  of  the  i.v.,  and  then  observing  a  sample  from  the  distribution 
of  the  d.v.  conditioned  on  the  given  i.v.  Figure  8.1  shows  a  synthetic  example  where  the  i.v.  is 
sampled  uniformly,  and  the  d.v.  is  from  the  Gaussian  distribution  whose  mean  is  proportional  to 
the  value  of  i.v.,  forming  the  black  line-shaped  point  set.  Many  real  world  situations,  including 
surveys  and  mobile  sensing,  produce  point  sets  of  this  type.  In  patch-based  image  analysis,  we 
first  specify  the  location  of  the  patches  as  the  i.v.  and  then  extract  their  features  as  the  d.v.  In  traffic 
monitoring,  a  helicopter  is  sent  to  specific  locations  at  specific  times  (z.v.)  and  measures  the  traffic 
volume  (d.v.). 


Unbiased 


i.v. 

Biased  Biased 


i.v.  i.v. 

Figure  8.1:  The  observation  biases. 

We  assume  that  the  sampling  bias  affects  the  way  we  observe  i.v.  ,  yet  the  observation  of 
d.v.  given  i.v.  remains  intact.  This  assumption  is  compatible  with  the  covariate  shift  model  [73, 
152].  As  shown  in  Figure  8.1,  an  unbiased  observer  will  sample  i.v.  uniformly  and  get  the  black 
set.  Biased  observers  might  focus  more  on  the  smaller  or  larger  values  of  the  z.v.  and  create  the 
biased  red  and  blue  sets,  where  the  curves  show  the  observed  marginal  densities  of  the  z.v.  The 
joint  and  marginal  distributions  of  the  biased  sets  now  look  very  different  from  each  other  and  the 
unbiased  set.  Nevertheless,  no  matter  what  the  distribution  of  z.v.  is,  the  distribution  of  d.v.  given 
z.v.  is  always  the  same  Gaussian  that  does  not  change  with  the  observer.  In  traffic  monitoring,  the 
helicopter  may  be  tasked  with  other,  non-traffic,  jobs  that  create  different  patrol  schedules  each 
day,  thus  creating  an  uneven  profile  of  the  city’s  traffic.  But  the  measured  traffic  volumes  at  the 
patrolled  locations  are  still  accurate. 
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To  correct  sampling  biases  of  this  kind,  we  propose  to  use  conditional  divergences.  Existing 
divergence-based  methods  use  the  joint  distribution  of  the  i.v.  and  the  d.v.  to  measure  the  differ¬ 
ences  between  point  sets.  On  the  other  hand,  conditional  divergences  focus  on  the  conditional 
distributions  of  d.v.  given  i.v.  and  are  insensitive  to  the  distribution  of  i.v.,  which  is  distorted  by 
the  sampling  bias  in  our  setting.  As  long  as  the  conditional  distributions  are  intact,  the  conditional 
divergences  will  be  reliable.  Moreover,  it  can  be  shown  that  the  divergence  between  joint  distribu¬ 
tions  is  a  special  case  of  the  conditional  divergence.  A  fast  and  consistent  estimator  is  developed 
for  the  conditional  divergences.  We  also  discuss  specific  examples  of  correcting  sampling  biases, 
including  some  extreme  cases. 

We  evaluate  the  effectiveness  of  conditional  divergences  on  both  synthetic  and  real  world  data 
sets.  On  synthetic  data  sets,  we  show  that  the  proposed  estimator  is  accurate  and  the  conditional 
divergences  are  capable  of  correcting  sampling  biases.  We  also  demonstrate  their  performance  on 
real-world  climate  and  image  classification  problems. 

The  rest  of  this  paper  is  organized  as  follows.  The  background  and  some  related  work  is  intro¬ 
duced  in  Section  8.2.  Section  8.3  defines  the  conditional  divergence  and  describes  its  properties 
and  estimation.  Section  8.4  describes  how  to  use  conditional  divergence  to  correct  various  sam¬ 
pling  biases.  In  Section  8.5  we  make  a  discussion  about  the  conditional  divergences.  In  Section 
8.6,  we  evaluate  the  effectiveness  of  the  proposed  methods  on  both  synthetic  and  real  data  sets.  We 
conclude  the  paper  in  Section  8.7. 

8.2  Background  and  Related  Work 

We  consider  a  data  set  that  consists  of  M  point  sets  {Gm}m=i,...,M,  and  each  point  set  Gm  is  a 
set  of  d-dimensional  vectors,  Grn  =  {zmn}n= zmn  £  Rd.  Each  point  zrnn  =  [xmn;ymn]  is 
a  concatenation  of  two  shorter  vectors  xmn  £  Rdx  and  ymn  £  Rdy  representing  the  independent 
variables  i.v.  and  the  dependent  variables  d.v.  respectively.  We  assume  that  each  G,n  has  an  under¬ 
lying  distribution  fm{z)  =  fm(x,  y),  and  the  points  {zrnn}  are  i.i.d.  samples  from  fm(z).  fm  can 
be  written  as  fm(z )  =  fm(y\x)fm(x).  In  the  context  of  image  classification,  each  Grn  is  an  image, 
and  xmn  is  the  location  of  the  nth  patch  and  ymn  is  the  feature  of  that  patch. 

We  can  learn  from  these  sets  by  estimating  the  divergence  between  the  /m’s  as  the  dissimilarity 
between  the  Gm' s.  Having  the  dissimilarities,  various  problems  can  be  solved  by  using  similarity 
based  learning  algorithms,  including  k-nearest  neighbors  (KNN),  spectral  clustering  [122],  and 
support  vector  machines  (SVM).  In  this  direction,  several  divergence-based  methods  have  been 
proposed  [17,  120,  131],  and  both  empirical  and  theoretical  successes  were  achieved. 

In  the  presence  of  sampling  bias  that  affects  the  distribution  of  i.v.,  fm(x)  is  transformed 
into  f'm(x).  Consequently  the  observed  Gm ,s  represent  the  biased  joint  distribution  f'n(z)  = 
fm(y\x)fm{x).  Therefore  naively  learning  from  the  point  sets  using  joint  distributions  will  lead  us 
to  the  distorted  /r'n’s  instead  of  the  true  /m’s.  To  correct  the  sampling  bias,  we  need  to  either  1) 
modify  the  point  sets  to  restore  f(z),  or  2)  use  similarity  measures  that  are  insensitive  to  fix). 

Existing  correction  methods  often  reweigh  the  points  in  the  training  set  so  that  its  effective 
distribution  matches  the  distribution  in  the  test  set  [34,  73,  152].  Our  proposed  conditional  di¬ 
vergences  are  insensitive  to  the  biased  distributions  of  the  independent  variables  and  thus  robust 
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against  sampling  biases. 

Traditionally  in  statistics  and  machine  learning,  sampling  bias  is  considered  between  the  train¬ 
ing  set  and  the  test  set.  In  contrast,  we  consider  problems  consisting  of  a  large  number  of  point 
sets,  and  our  goal  is  to  learn  from  the  sets  themselves.  This  extension  raises  many  important  chal¬ 
lenges,  including  how  to  find  a  common  basis  to  compare  all  pairs  of  distributions,  how  to  deal 
with  unobserved  segments  of  distributions,  and  how  to  design  efficient  algorithms. 

To  our  knowledge,  this  is  first  time  sampling  bias  is  addressed  in  the  context  of  learning  from 
sets  of  points.  Algorithms  such  as  [17,  65,  76,  112,  120,  130,  131]  all  directly  compare  the  joint 
distributions  of  the  observed  points,  making  them  susceptible  to  sample  bias.  [128]  proposed  the 
use  of  conditional  divergence,  yet  sampling  bias  was  still  not  considered. 


8.3  Conditional  Divergences 

We  propose  to  measure  the  dissimilarity  between  two  distributions  p{z)  =  p(x,  y)  and  q(z)  = 
q{x,  y )  using  the  conditional  divergence  (CD)  based  on  the  Kullback-Leibler  (KL)  divergence: 


CDc(x)  (p(z)\\q(z))  =  Ec(x)  [KL  (p(y\x)\\q(y\x))}  (8.1) 


where  c(x)  is  a  user-specified  distribution  over  which  the  expectation  is  taken.  CD  is  the  average 
KL  divergence  between  the  conditional  distributions  p(y\x)  and  q(y\x)  over  possible  values  of  x, 
and  c(x)  can  be  considered  as  the  importance  of  the  divergences  at  different  x’s.  CD’s  definition  is 
free  of  the  i.v.  distributions  p(x)  and  q(x),  which  are  vulnerable  to  sampling  biases.  By  definition, 
CD  has  a  lot  in  common  with  the  KL  divergence:  it  is  non-negative,  and  equals  zero  if  and  only 
if  p(y\x)  =  q(y\x)  for  every  x  within  the  support  of  c(x).  CD  is  also  not  a  metric  and  not  even 
symmetric. 

In  the  form  of  (8.1),  CD  is  hard  to  compute  because  the  divergences  KL  (p(y\x)\\q(y\x))  are 
not  available  for  arbitrary  continuous  distributions.  Also  note  that  c(x)  is  a  distribution  specified 
by  the  user.  To  make  CD  more  accessible,  we  can  rewrite  it  as 


CDc(a;)  (p(z)\\q(z))  =  Ep(z) 


c(x) 

(V« 

1./W)] 

p(x) 

V  q(z) 

q(x))_ 

(8.2) 


Now,  CD  is  defined  in  terms  of  the  density  ratios  of  the  input  distributions  and  the  expectation  over 

p(z). 

An  interesting  case  of  (8.2)  occurs  when  we  choose  c(x)  =  p(x),  which  gives  the  result 


cdp(z)  (p(z)\\q(z))  =  kl(p(-)  I  !<?(-))  -  KL(p(x)||g(x)).  (8.3) 

We  can  see  this  special  CD  is  equal  to  the  joint  divergence  (divergence  between  joint  distributions) 
minus  the  divergence  between  the  marginal  distributions  of  x.  Intuitively,  CD  is  removing  the  effect 
of  p{x)  and  q(x)  from  the  joint  divergence,  so  that  the  net  results  are  free  from  the  sampling  bias. 
Moreover,  when  p(x)  and  q(x)  are  the  same,  KL(p(x)  \\q(x))  vanishes  and  this  CD  equals  the  joint 
divergence.  In  other  words,  when  there  is  no  sampling  bias,  CDp(x)  (p(z)  \ \q(z))  =  KL(p(z)  |  |g(z)). 
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8.3.1  Estimation 


In  this  section  we  give  an  estimator  for  CD  (8.2).  Suppose  we  have  two  sets  Gp  and  Gq  with 
underlying  distributions  p(z)  and  q(z)  respectively.  We  can  approximate  the  expectation  (8.2)  with 
the  empirical  mean  and  estimated  densities: 


CDc(a;)  (p(z)\\q{z)) 


1 

Np 


E 

n= 1 


( in  p(zp^  _  ]n  p(xp,n)\ 

\  fen)  q(Xp,n)J’ 


(8.4) 


where  Np  is  the  size  of  Gp,  p,  q  are  the  estimates  of  p,  q. 

c(t)  is  an  arbitrary  input  from  the  user  and  we  can  see  that  its  role  is  to  reweight  the  log-density- 
ratios  at  different  points  in  Gp.  To  generalize  this  notion,  we  define  the  generalized  conditional 
divergence  (GCD)  and  its  estimator  as  the  weighted  average  of  the  log-density-ratios: 


Np 

GCD„,  (p(z)\\q(z))  =  Yw(xP,n) 

n—  1 
Np 

GCD„,  (p(z)\\q(z))  =  Yw(xp,n) 

n= 1 
Np 

YW(XP,U )  =  1  ,w(xPtn)  >  0, 

n= 1 


( l  P(ZP,n)  _  i  PiXP,n)\ 

V  q(zp,  n)  q(xp,n)J 

( ln  P^ZP^  _  NXp,n)\ 

\  q(zP, n)  q(xp,n) ) 


(8.5) 

(8.6) 


where  w(x)  is  the  weight  function  and  the  constraint  J2nw(xn)  =  1  is  induced  by  the  fact  that 


Np 


lim 

Np^-oo 


YW(XP,n) 

n= 1 


Np — >oo  TVp 


iVp 

E 

n= 1 


E'p(a:) 


c(x) 
p(x ) 


=  /  W)p(x)ix 


1. 


To  obtain  the  density  estimates  p,  q,  we  use  the  k-nearest-neighbor  (KNN)  based  estima¬ 
tor  [105].  Let  the  f(z)  be  the  d-dimensional  density  function  to  be  estimated  and  Z  =  {zn}n= G 
be  samples  from  f(z).  Then  the  density  estimate  at  the  point  z'  is 


/V) 


k 

Nci(d)(i>z,k(z'y 


(8.7) 


where  ci(d)  is  the  volume  of  the  unit  ball  in  the  d-dimensional  space,  and  <j>z,k{z')  denotes  the 
distance  from  z!  to  its  k\h  nearest  neighbor  in  Z  (if  z'  is  already  in  Z  then  it  is  excluded).  This 
estimator  is  chosen  over  other  options  such  as  the  kernel  density  estimation  because  it  is  simple, 
fast,  and  leads  to  a  provably  convergent  estimator  as  shown  below. 

By  plugging  in  (8.7)  into  (8.6),  we  can  get  the  following  estimator  for  GCD: 


Np 

GCD.tu  (p(z)\\q(z))  =  Yw(xp,n) 

n=l 


0G9,fc(%>,n) 

0Gp,fc(^p,n) 


—  dx  In 


4>Gq,k{xp,n )  \ 
4>Gp,k{xp,n)  J 


(8.8) 
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where  dx  is  the  dimensionality  of  the  x.  We  can  see  that  the  resulting  estimator  has  a  simple  form 
and  can  be  calculated  based  only  on  the  KNN  statistics  0,  which  are  efficient  to  compute  using 
space-dividing  trees  or  even  approximate  KNN  algorithms  such  as  [121].  Also  note  that  even 
though  the  estimator  (8.8)  is  obtained  using  the  density  estimator  (8.7),  its  final  form  only  involves 
simple  combinations  of  the  log-KNN-statistics  In  0.  Thus,  this  GCD  estimator  effectively  avoids 
explicit  density  estimation  which  is  notoriously  difficult,  especially  in  high  dimensions. 

More  importantly,  the  GCD  estimator  (8.8)  has  stronger  convergence  properties  than  the  densi¬ 
ty  estimator  from  which  it  is  derived.  Standard  convergence  results  have  that  the  density  estimator 

(8.7)  is  statistically  consistent  only  if  k/n  — >  0,  k  — >  oo  simultaneously.  However,  for  estimator 

(8.8)  convergence  can  be  achieved  even  for  a  fixed  finite  k.  This  means  that  we  can  always  use  a 
small  k  to  keep  the  nearest  neighbor  search  fast  and  still  get  good  estimates.  Specifically,  following 
the  work  of  [129,  168],  the  following  theorem  can  be  proved: 

Theorem  3.  Suppose  the  density  function  pairs  (p(z),q(z))  and  (p(x).  q(x))  are  both  2-regular 
(as  defined  in  [168]).  Also  suppose  that  the  weight  function  satisfies  linqvp->.oo  w(xPiV)  =  0,  Vn. 
Then  the  estimator  (8.8)  is  If  consistent  for  any  fixed  k.  That  is 


lim  E 

Np,Nq^OO 


GCDw(p(z)\\q(z )) 


GCDw(p(z)\\q(z)) 


(8.9) 


The  proof  of  Theorem  3  is  similar  to  what  was  used  in  [168].  The  condition  lim  w(xp;n)  =  0 

Np—>00  ’ 

ensures  that  the  weight  function  does  not  concentrate  on  only  a  few  points.  We  omit  the  detailed 
proof  here.  Note  that  the  convergence  of  GCD  does  not  carry  to  CD  (8.4)  because  the  weight 
function  w(xPtH )  =  is  no  longer  deterministic.  However,  empirically  we  found  that  (8.4) 

exhibits  the  behavior  o/ a  consistent  estimator  and  produces  satisfactory  results. 


8.4  Choosing  c(x) 

To  use  CD,  we  have  to  choose  the  appropriate  c(x)  or  w(x).  When  learning  from  point  sets,  it  is 
preferable  to  use  the  same  c(x)  to  compute  the  CDs  between  all  pairs  of  sets,  so  that  they  have  a 
common  basis  to  compare.  However,  this  is  not  always  necessary  or  possible.  Even  though  the 
choice  of  c(x)  and  w(x)  can  be  arbitrary,  we  consider  3  options  below. 

First,  we  can  let  c(x)  oc  1  so  that  uj(xp.n)  oc  p~1{xp,n)  to  treat  every  value  of  x  equally. 
The  disadvantage  is  that  p~l(xppi)  has  to  be  estimated,  which  is  error  prone.  We  can  also  use 
c(x)  =  p(x)  and  w(xPiV)  ex  1,  leading  to  (8.3).  In  this  case,  different  pairs  of  sets  can  have  different 
c(er)’s.  When  the  sampling  bias  is  small,  these  differences  might  be  acceptable  considering  the 
possible  errors  in  w(x)  otherwise.  Thirdly,  c(x)  oc  p(x)q(x)  and  w(xPtn)  oc  q(xPtn)  puts  the  focus 
on  regions  where  both  p(x)  and  q(x)  are  high.  It  means  that  we  should  put  larger  weights  in  dense 
regions  and  avoid  scarce  regions  to  get  reliable  estimates. 

One  caveat  is  that  the  weight  function  and  the  log-density-ratios  in  CD  should  not  use  the 
same  density  estimate,  otherwise  the  estimation  errors  will  correlate  and  cause  systematic  over¬ 
estimations.  Using  different  estimators  can  help  decouple  the  errors  and  avoid  accumulation.  In 
practice,  we  use  the  estimator  (8.7)  with  a  different  k. 
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Some  extreme  cases  of  sampling  bias  are  when  whole  segments  of  the  distribution  are  missing 
from  the  sample  and  therefore  unobserved.  Two  sets  can  even  have  disjoint  supports  of  x.  With 
the  CD,  we  can  choose  c(x)  oc  p(x)q(x)  or  c(x)  oc  I(p(x)q(x)  >  0),  where  /(•)  is  the  indicator 
function,  and  only  compare  two  sets  in  their  overlapping  regions.  The  result  may  not  be  accurate 
with  respect  to  the  true  divergence,  but  it  is  still  a  valid  measurement  of  the  differences  between 
conditional  distributions.  When  f(y\x)  only  weakly  depends  on  x,  this  estimate  can  be  an  adequate 
approximation  to  the  original  divergence.  If  f(y\x)  varies  drastically  for  different  x’s  without  any 
regularity  then  only  comparing  the  overlapping  regions  might  be  the  best  we  can  do. 

When  two  sets  have  disjoint  supports  in  x,  no  useful  information  can  be  extracted  and  the 
corresponding  divergence  has  to  be  regarded  as  missing  without  further  assumptions.  Nevertheless, 
in  our  settings  where  a  large  number  of  point  sets  are  available,  it  is  likely  that  each  set  will  share 
its  support  in  x  with  at  least  some  others  to  provide  a  few  reliable  divergence  estimates.  We  might 
be  able  to  infer  the  divergence  between  disjoint  sets  using  the  idea  of  triangulation.  We  shall  leave 
this  possibility  for  future  investigation. 


8.5  Discussion 

In  CD,  c(x)  conveys  prior  knowledge  about  the  importance  of  different  x’s.  It  should  be  carefully 
chosen  based  on  the  data,  and  poor  results  can  happen  when  the  assumptions  made  in  c(x)  are  not 
valid.  For  example,  c(x)  oc  1  assumes  that  all  the  x’s  are  equally  important.  This  could  be  a  bad 
assumption  when  the  supports  of  two  sets  do  not  overlap,  because  at  some  x’s  one  of  the  densities 
will  be  zero,  making  the  conditional  densities  f(y\x)  not  well-defined.  Similar  problems  might 
occur  in  regions  where  one  of  the  densities  is  very  low.  Numerically  the  estimator  can  still  work 
but  usually  produces  poor  results.  In  this  scenario,  c(x)  oc  p(x)q(x)  suits  the  data  better. 

The  CD  estimator  (8.8)  relies  on  the  KNN  statistics  which  is  the  distance  between  nearest 
neighbors.  Usually  we  use  Euclidean  distance  to  measure  the  difference  between  points  and  find 
nearest  neighbors.  However,  the  estimator  does  not  prevent  the  use  of  other  distances.  In  fact, 
[105]  shows  that  alternative  distances  can  be  used  and  the  consistency  results  will  generally  still 
hold.  A  common  choice  of  adaptive  distance  measure  is  the  Mahalanobis  distance  [156],  which  is 
equivalent  to  applying  a  linear  transformation  to  the  random  variables.  It  is  even  possible  to  learn 
the  distance  metric  for  0  in  a  supervised  way  to  maximize  the  learning  performance.  We  leave  this 
possibility  as  future  work. 

The  estimated  conditional  divergences  can  be  used  in  many  learning  algorithms  to  accomplish 
various  tasks.  In  this  paper,  we  use  kernel  machines  to  classify  point  sets  as  in  [130,  131].  Having 
the  divergence  estimates,  we  convert  them  into  Gaussian  kernels  and  then  use  SVM  for  classi¬ 
fication.  When  constructing  kernels,  all  the  divergences  are  symmetrized  by  taking  the  average 
ji(p.  q)  =  dl'pWq)+d{<i\\p) '  j]lc  symmetrized  divergences  y  are  then  exponentiated  to  get  the  Gaus¬ 
sian  kernel  k(p,  q)  =  exp  (—7 y(p,  q))  and  the  kernel  matrix  K,  where  7  is  the  width  parameter. 
Unfortunately,  K  usually  does  not  represent  a  valid  Mercer  kernel  because  the  divergence  is  not  a 
metric  and  random  estimation  errors  exist.  As  a  remedy,  we  discard  the  negative  eigenvalues  from 
the  kernel  matrix  K  to  convert  it  to  its  closest  positive  semi-definite  (PSD)  matrix  K.  This  K  then 
is  a  valid  kernel  matrix  and  can  be  used  in  an  SVM  for  learning. 
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8.6  Experiments 


We  examine  the  empirical  properties  of  the  conditional  divergences  and  their  estimators.  The  tested 
divergences  are  listed  below. 

•  Full  D:  Divergence  between  full  unbiased  sets  as  the  groundtruth. 

•  D:  Divergence  between  biased  sets. 

•  D-DV:  Divergence  between  biased  sets  while  ignoring  the  i.v.. 

•  CD-P,CD-U,CD-PQ:  conditional  divergences  with  c(x)  oc  p(x),c(x)  oc  1,  c(x)  ex  p(x)q(x) 
respectively  between  biased  sets. 

Full  D,  D,  D-DV  are  estimated  using  the  KL  divergence  estimator  proposed  by  [168].  Unless 
stated  otherwise,  we  use  k  —  3  for  GCD  estimation  using  (8.8),  and  use  k  values  between  30  and 
50  to  compute  the  weight  function. 

We  consider  two  types  of  sampling  biases.  The  first  type  creates  different  f{x)’ s  for  different 
sets,  yet  they  still  share  the  same  support  of  x  as  the  original  unbiased  data.  Based  on  the  first 
type,  the  second  type  of  sampling  bias  is  more  extreme  and  can  hide  certain  segments  of  the  true 
distributions,  and  thus  causes  different  sets  to  have  different  supports  of  x.  We  call  the  resulting 
test  sets  from  these  two  sampling  biases  uneven  sets  and  partial  sets  respectively. 

In  order  to  evaluate  the  quality  of  the  bias  correction  by  the  CDs,  we  use  controlled  sampling 
biases  in  our  experiments.  The  original  point  set  data  are  collected  from  real  problems  without  any 
sampling  bias.  Then  we  resample  each  set  to  create  artificial  sampling  biases.  By  doing  this,  we 
can  compare  the  results  using  the  biased  sets  to  the  divergences  using  the  unbiased  data  which  is 
the  groundtruth. 

An  SVM  is  used  to  classify  the  point  sets  using  the  method  described  in  Section  8.5.  When 
using  the  SVM,  we  tune  the  width  parameter  7  and  the  slack  penalty  C  by  3-fold  cross-validation 
on  the  training  set. 

8.6.1  Synthetic  Data 

Estimation  Accuracy 

We  generate  synthetic  data  to  test  the  accuracy  of  the  proposed  conditional  divergence  estimators. 
The  data  set  consists  of  2-dimensional  (one  as  i.v.  and  one  as  d.v.)  Gaussian  noise  along  two 
horizontal  lines  as  the  two  classes,  as  shown  in  Figure  8.2a  and  8.2b.  The  Gaussians  have  fixed 
spherical  covariance,  and  the  mean  of  the  blue  class  is  slightly  higher  than  the  red  class,  resulting 
in  an  analytical  KL  divergence  of  0.5.  Then  the  i.v.  (x  axis)  is  resampled  to  create  sampling  bias 
and  the  red  and  blue  curves  show  the  resulting  marginal  densities  fred(x),  fbUe(x).  The  task  is  to 
recover  the  true  divergence  value  0.5  from  this  biased  sample.  We  vary  the  sample  size  to  see 
the  empirical  convergence,  and  the  results  of  10  random  runs  are  reported.  The  shortcut  for  this 
problem  is  to  ignore  the  i.v.,  but  we  do  not  let  the  estimators  take  it  and  force  them  to  recover  from 
the  sampling  bias. 

Figure  8.2a  shows  the  results  on  the  uneven  sets.  As  expected,  the  joint  divergences  are  cor¬ 
rupted  by  the  sampling  bias  and  are  far  from  the  truth.  The  three  CDs  all  converge  to  the  true 
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value.  Figure  8.2b  shows  the  results  on  the  partial  sets.  The  joint  divergence  diverges  in  this  case. 
CD-P  and  CD-U  are  closer  but  not  converging  to  the  correct  value,  and  the  reason  is  that  the  non¬ 
overlapping  supports  violate  the  assumptions  made  by  them.  CD-PQ  successfully  achieved  the 
true  value.  This  shows  the  advantage  of  only  measuring  CD  within  the  overlapping  region  in  this 
example.  Overall,  the  CDs  are  effective  against  sampling  bias  and  the  estimators  converge  to  the 
true  values. 


-0.5  -0.4  -0.3  -0.2  -0.1  0  0.1  0.2  0.3  0.4  0.5 


i.v. 


(a)  Uneven 


-0.6  -0.4  -0.2  0  0.2  0.4  0.6 


i.v. 


(b)  Partial 


Figure  8.2:  Estimated  divergences  on  the  synthetic  data. 


Handling  Point  Sets 

Here  we  test  the  estimators  using  a  large  number  of  point  sets.  The  full  data  of  two  classes  are 
shown  in  Figure  8.4a.  To  create  partial  sets,  we  use  a  sliding  window,  whose  width  is  half  of  the 
data’s  span,  to  scan  the  full  data  and  at  each  position  put  the  points  within  the  window  together  as  a 
set.  The  uneven  sets  are  then  created  by  combining  the  partial  sets  with  a  small  number  of  random 
samples  from  the  original  data.  100  sets  are  created  for  each  class  and  each  set  contains  200  —  300 
points. 

This  data  set  is  more  challenging:  the  marginal  distribution  of  d.v.  cannot  differentiate  the  two 
classes;  the  conditional  distributions  f(y\x)  are  dependent  on  x;  near  the  center  of  the  data  the 
conditional  distributions  of  the  two  classes  are  very  close.  The  different  divergence  matrices  on 
the  uneven  sets  are  shown  in  Figure  8.3,  in  which  we  sorted  the  sets  according  to  their  classes  and 
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window  positions  to  show  the  structures.  We  see  that  the  joint  divergence  is  severely  affected  by 
the  sampling  bias,  while  the  CDs  are  quite  insensitive.  The  result  of  CD-U  is  especially  impressive: 
the  similarity  structure  of  the  original  data  is  perfectly  recovered.  Figure  8.4  shows  the  results  on 
the  partial  sets.  The  joint  divergence  is  now  dominated  by  the  sampling  bias.  CDs  again  are  able 
to  recover  from  this  severe  disruption  and  achieve  reasonable  results.  The  result  of  CD-PQ  is  the 
cleanest  on  this  data  set. 


Figure  8.3:  Divergences  on  the  uneven  sets.  The  goal  is  to  recover  the  “Full  D”  given  only  the 
biased  sets. 


(a)  Original  data. 


CD-P 


CD-U 


(b)  Divergences 


CD-PQ 


Figure  8.4:  Divergences  on  the  partial  sets.  The  goal  is  to  recover  the  “Full  D”  result  shown  in 
Figure  8.3. 


8.6.2  Season  Classification 

In  this  section  we  use  the  divergences  in  SVM  to  classify  real  world  point  sets  generated  by  sensor 
networks.  We  gathered  the  data  from  the  QCLCD  climate  database  at  NCDC  1 .  We  use  a  subset 
of  QCLCD  that  contains  daily  climatological  data  from  May  2007  to  May  2013  measured  by 
1, 164  weather  stations  in  the  continental  U.S.  Each  of  these  weather  station  produces  various 
measurements  such  as  the  temperature,  humidity,  precipitation,  etc ,  at  its  location.  We  aggregate 
these  data  into  point  sets,  so  that  each  set  contains  the  measurements  from  all  stations  in  one  week. 

We  consider  the  problem  of  predicting  the  season  of  a  set  based  on  the  average  temperature 
measurement.  Specifically,  we  want  to  know  if  a  set  corresponds  to  Spring  or  Fall  based  on  the 
average  temperatures  over  the  U.S.  Note  that  classifying  Summer  and  Winter  would  be  too  easy, 
while  differentiating  Spring  and  Fall  can  be  challenging  since  they  have  similar  average  tempera¬ 
tures.  Nevertheless,  it  is  still  possible  based  on  the  geographical  distribution  of  the  temperatures. 
Figure  8.5  shows  the  temperature  maps  in  a  first  week  of  March  and  a  first  week  of  November. 

'http : / / www . ncdc . noaa . gov 
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Again,  we  create  uneven  and  partial  sets  based  on  the  original  data  by  randomly  positioning 
a  full- width  window  whose  height  is  20%  of  the  data’s  vertical  span,  as  shown  in  Figure  8.5. 
This  injection  of  sampling  bias  is  simulating  the  scenario  where  we  only  have  a  sensoring  satellite 
orbiting  parallel  to  the  equator.  In  this  problem,  the  location  is  the  i.v.  and  the  temperature  is  the 
d.v..  This  procedure  gives  us  160  3-dimensional  (latitude,  longitude,  temperature)  point  sets  with 
sizes  around  2,  000. 
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Figure  8.5:  Example  temperature  maps  of  the  U.S.  from  the  QCLCD.  (a)  and  (c)  are  the  original 
data,  (b)  and  (d)  are  the  artificially  created  uneven  data. 


In  each  run,  20%  of  the  random  point  sets  are  used  for  training  and  the  rest  are  used  for  testing. 
Classification  results  of  10  runs  are  reported  in  Figure  8.6.  On  the  uneven  sets,  we  see  that  both  CD- 
U  and  CD-PQ  are  able  to  recover  from  the  sampling  bias  and  achieve  results  that  are  only  3%  worse 
than  the  full  divergence.  On  the  partial  sets,  however,  the  performance  CD-U  dropped  significantly. 
This  indicates  that  it  can  be  risky  to  apply  CD  in  regions  where  two  sets  do  not  overlap.  It  is 
interesting  to  see  that  D-DV,  which  ignores  the  locations,  barely  does  better  than  random  since 
Spring  and  Fall  indeed  have  similar  temperatures.  Yet  by  considering  the  geographical  distribution 
of  temperatures  we  can  achieve  70%  accuracy. 


8.6.3  Image  Classification 

We  can  also  use  CDs  to  classify  scene  images.  We  construct  one  point  set  for  each  image,  where 
each  point  describes  one  patch  including  its  location  (i.v.)  and  the  feature  (d.v.).  The  OT  [5]  scene 
images  are  used,  which  contain  2, 688  grayscale  images  of  size  256  x  256  from  8  categories.  The 
patches  are  sampled  densely  on  a  grid  and  multiscale  SIFT  features  are  extracted  using  VLFeat 
[164].  The  points  are  reduced  to  20-dimensions  using  PCA,  preserving  70%  of  variance. 
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Full  D  D-DV  D  CD-P  CD-U  CD-PQ  Full  D  D-DV  D  CD-P  CD-U  CD-PQ 


(a)  QCLCD,  uneven.  (b)  QCLCD,  partial. 

Figure  8.6:  Season  classification  results  on  the  QCLCD  weather  data. 


Again,  we  create  both  uneven  and  partial  point  sets  by  randomly  positioning  a  full-width  win¬ 
dow  whose  height  is  60%  of  the  image.  By  doing  this,  the  injected  sampling  bias  forces  a  set  to 
focus  on  a  specific  horizontal  part  of  the  scene.  For  instance  in  a  beach  scene,  the  biased  observer 
focuses  either  on  the  sky  or  the  sand,  and  only  see  a  small  part  of  the  rest  of  the  scene.  After  the 
above  processing,  the  full  data  set  contains  2,  688  sets  of  20-dimensional  points,  and  the  sets’  sizes 
are  around  1,  600.  In  the  biased  data,  each  partial  set  has  about  950  points  and  each  uneven  set  has 
about  1, 100.  In  each  run,  we  randomly  select  50  images  per  class  for  training  and  another  50  for 
testing. 

Results  of  10  random  runs  are  shown  in  Figure  8.7.  In  these  results,  CDs  again  successfully 
restore  the  accuracies  to  a  high  level  even  in  the  face  of  harsh  sampling  biases.  We  see  that  CD-U 
impressively  beats  the  other  methods  by  a  large  margin  on  the  uneven  sets,  and  is  only  1%  worse 
than  the  full  divergence.  CD-PQ  is  the  best  on  partial  sets.  These  results  show  the  CDs’  corrective 
power  when  the  correct  assumptions  are  made  about  the  sampling  biases. 

We  also  observe  that  CD-U  and  CD-P  did  not  perform  well  on  the  partial  sets,  which  is  expected 
since  their  assumptions  were  invalid  on  the  data.  In  general,  the  impact  of  sampling  bias  on  this 
data  set  is  small  (less  than  10%  decrease  in  accuracies)  because  the  patch  features  ( d.v .)  only 
weakly  depend  on  the  patch  locations  (/.  v).  In  fact,  many  patch-based  image  analyses  such  as  [50] 
do  not  include  the  locations.  This  might  explain  why  both  D-DV  and  D-P  did  reasonably  well  in 
this  task  and  the  corrected  results  by  CD-PQ  are  only  slightly  better. 


8.7  Summary 

In  this  paper  we  described  various  aspects  of  dealing  with  sampling  bias  when  learning  from  point 
sets.  We  proposed  the  conditional  divergence  (CD)  to  measure  the  difference  between  point  sets 
and  alleviate  the  impact  of  sampling  bias.  An  efficient  and  convergent  estimator  of  CD  was  pro¬ 
vided.  We  then  discussed  how  to  deal  with  various  types  of  sampling  biases  using  CD.  In  the 
experiments  we  show  that  these  methods  are  effective  against  sampling  bias  on  both  synthetic  and 
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Full-D  D-DV  D-P  CD-P  CD-U  CD-PQ 

(a)  Image,  uneven. 


Full— D  D-DV  D-P  CD-P  CD-U  CD-PQ 

(b)  Image,  partial. 


Figure  8.7:  Image  classification  results  on  OT. 


real  data. 

Several  directions  can  be  explored  in  the  future.  We  can  extend  the  definition  of  conditional  di¬ 
vergence  from  KL  divergence  to  the  more  general  Renyi  divergences.  The  generalized  conditional 
divergences  provide  the  possibility  of  learning  the  weights  of  the  density  ratios  in  a  supervised 
ways  in  order  to  maximize  the  discriminative  power  of  the  resulting  divergences.  The  distance 
between  points  used  in  estimating  the  CDs  could  also  be  learned.  Finally  for  extreme  cases  that 
cause  missing  divergences,  we  may  infer  them  by  exploiting  the  relationships  among  the  sets  using 
low-rank  matrix  completion  techniques  described  in  Chapter  6. 
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Chapter  9 

Conclusion  and  Future  Directions 


Many  real  world  problems  generate  a  large  amount  collective  data  that  are  organized  by  groups. 
To  effectively  leam  from  them  we  need  new  tools  from  machine  learning.  How  and  what  we  can 
leam  from  collective  data?  In  this  thesis,  we  describe  the  research  we  have  conducted  in  answering 
this  question. 

We  considered  different  types  of  collective  data  and  different  ways  to  approach  them.  The  first 
type  consists  of  groups  of  discrete  points.  These  groups  can  naturally  be  reduced  to  vectors  and 
we  proposed  two  novel  factorization  models  to  learn  from  them.  The  first  model,  called  Bayesian 
probabilistic  tensor  factorizations,  is  able  to  capture  the  temporal  dynamics  of  the  data,  and  uses 
Bayesian  techniques  to  avoid  overfitting  and  parameter  tuning.  The  second  model  direct  robust 
matrix  factorization  addresses  the  outliers  in  the  data,  and  seek  to  find  robust  factors/subspaces  as 
well  as  identify  the  outliers.  Both  of  these  methods  are  simple  and  efficient  for  practical  usages. 

The  main  focus  of  this  thesis,  however,  is  on  the  more  commonly  encountered  collective  data: 
groups  of  real-valued  multidimensional  points.  We  developed  both  generative  and  discriminative 
methods  to  leam  from  them.  From  the  generative  perspective,  we  can  first  learn  the  generating 
process  of  the  data  and  then  use  it  to  accomplish  various  learning  tasks.  Motivated  by  the  group 
anomaly  problem,  and  facilitated  by  the  topic  modeling  techniques,  we  developed  two  flexible 
genre  models  to  characterize  how  a  collective  data  set  was  generated.  We  further  designed  several 
scoring  functions  based  on  these  models  to  find  different  types  of  group  anomalies. 

We  also  took  the  kernel  approach  to  learn  from  collective  data  discriminatively.  Thanks  to 
the  newly  proposed  non-parametric  divergence  estimators,  we  can  derive  a  new  class  of  consis¬ 
tent  and  efficient  kernel  estimators  for  collective  data.  These  kernels  achieved  the  state-of-the-art 
performances  in  image  classification  task.  Further  efforts  were  made  to  study  different  ways  of 
constructing  Mercer  kernels  from  the  raw  divergences  in  order  to  exploit  the  information  in  the 
divergence  matrices. 

We  then  addressed  several  practical  problems  in  the  kernel  methods.  The  kernel  estimators  in 
this  work,  though  relatively  efficient,  are  still  slow  in  practice.  In  order  to  accelerate,  we  studied 
different  ways  of  reducing  sizes  of  the  groups,  and  discovered  that  A' -Means  was  able  to  condense 
the  information  in  the  original  groups  into  much  smaller  ones.  As  a  result,  the  computation  can  be 
orders-of-magnitude  faster  and  the  learning  performance  can  be  preserved.  The  second  practical 
issue  we  considered  was  the  sampling  bias.  In  the  presence  of  sampling  bias,  the  observed  groups 
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of  points  are  not  representative  of  the  groups’  underlying  attributes.  To  solve  this  problem,  we  im¬ 
proved  the  traditional  divergences  and  proposed  the  conditional  divergences ,  for  which  an  efficient 
estimator  was  also  developed.  Under  certain  assumptions,  conditional  divergences  are  insensitive 
to  common  sampling  biases  in  data. 

The  methods  we  proposed  are  widely  applicable  in  many  real-world  problems.  In  this  thesis  our 
attention  is  paid  primarily  to  the  scientific  discovery  process.  We  developed  automatic  discovery 
and  learning  systems  for  data  sets  from  astronomy  and  physics  based  on  the  research  in  this  thesis. 
This  system  facilitates  the  collaboration  between  us  and  the  scientists  by  presenting  the  learning 
results  to  and  collecting  feedbacks  from  the  experts.  We  believe  this  is  the  first  step  in  building 
more  powerful  systems  in  the  future. 

The  majority  of  the  research  in  this  thesis  depends  on  the  assumption  that  the  points  in  a  group 
are  either  exchangeable  or  id.d.  samples  from  the  underlying  distribution.  As  of  now  this  assump¬ 
tion  has  been  prevailing  in  various  areas  including  text  modeling  and  computer  vision,  and  has 
achieved  great  successes.  Nevertheless,  in  the  future  we  would  like  study  the  particularly  inter¬ 
esting  case  of  structured  groups,  in  which  points  dependent  on  other  points.  For  example,  we  can 
consider  the  Markovian  dependencies  between  words  in  the  same  document,  or  between  patches 
in  the  same  image.  With  these  additional  characterizations  of  data,  better  results  on  learning  from 
collective  data  can  be  expected. 

There  are  a  lot  of  other  problems  that  remain  to  be  studied  based  on  this  work.  For  exam¬ 
ple,  considering  that  real-world  data  sets  almost  always  contains  outliers,  we  wish  to  make  our 
methods  robust  so  that  the  results  are  more  reliable.  This  requires  the  development  of  robust  top¬ 
ic  models  and  kernel  estimators.  We  also  want  to  expanded  the  research  to  situations  where  the 
points  are  functions.  This  is  quite  common  in  astronomy  where  a  spectrum  is  considered  as  a  noisy 
observations  of  the  object’s  underlying  characteristic  spectral  function  within  a  certain  wavelength 
range.  Finally,  continuous  effort  has  to  be  made  in  improving  the  algorithms’  speed  to  gain  actual 
practicality. 

We  believe  that  our  current  work  is  just  a  beginning  and  much  remains  to  be  done  in  the  future. 
Learning  from  collective  data  directly  has  been  a  less  active  topic  in  machine  learning  probably  be¬ 
cause  it  requires  a  large  amount  of  computational  resources  and  the  mathematical  representations 
of  the  problems  are  less  concise  and  elegant  than  the  point- wise  learning.  However,  the  vast  ad¬ 
vancements  of  computer  hardware  and  parallel  computing  tools  have  largely  cleared  the  obstacles 
and  it  is  interesting  and  useful  and  further  explore  this  realm. 
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